[Air-L] Social Media data fraught with biases and distortion
Elijah Wright
elijah.wright at gmail.com
Mon Dec 1 08:32:27 PST 2014
My reaction to that title is scorn and the impulse to point out that
all data has biases, and statistics is how you turn raw data into
insights about *the data*.
After a little more consideration, I think the press release is much
more reasonable than the title makes it out to be. Yes, it's very
common that people want to make it implicit that social media is a
good representation of reality. It ain't necessarily so, folks! :)
And a great big "Yes! That!" to inbuilt sampling bias and "black box"
views of available data - these are very real problems that can come
along with complexity and huge rivers of data. :)
(To reply to Peter Timusk's other reply to this post -- yes, lots of
people write bad papers. Not worth worrying about, really. It's a
shame, but... what can you do? Best thing is always to do our own,
superlative work, and just be model citizens of the ecosystem :) )
best,
--e
On Sun, Nov 30, 2014 at 1:35 PM, Katja Mayer <katja.mayer at univie.ac.at> wrote:
> FYI:
> http://www.sciencemag.org/content/346/6213/1063.summary
>
> From the press release:
> Using Social Media For Large Behavioral Studies Is Fast and Cheap, But
> Fraught With Biases and Distortion
>
> PITTSBURGH—The rise of social media has seemed like a bonanza for behavioral
> scientists, who have eagerly tapped the social nets to quickly and cheaply
> gather huge amounts of data about what people are thinking and doing. But
> computer scientists at Carnegie Mellon University and McGill University warn
> that those massive datasets may be misleading.
> In a perspective article published in the Nov. 28 issue of the journal
> Science, Carnegie Mellon’s Juergen Pfeffer and McGill’s Derek Ruths contend
> that scientists need to find ways of correcting for the biases inherent in
> the information gathered from Twitter and other social media, or to at least
> acknowledge the shortcomings of that data.
>
> And it’s not an insignificant problem; Pfeffer, an assistant research
> professor in CMU’s Institute for Software Research, and Ruths, an assistant
> professor of computer science at McGill, note that thousands of research
> papers each year are now based on data gleaned from social media, a source
> of data that barely existed even five years ago.
> “Not everything that can be labeled as ‘Big Data’ is automatically great,”
> Pfeffer said. He noted that many researchers think — or hope — that if they
> gather a large enough dataset they can overcome any biases or distortion
> that might lurk there. “But the old adage of behavioral research still
> applies: Know Your Data,” he maintained.
> Still, social media is a source of data that is hard to resist. “People want
> to say something about what’s happening in the world and social media is a
> quick way to tap into that,” Pfeffer said. Following the Boston Marathon
> bombing in 2013, for instance, Pfeffer collected 25 million related tweets
> in just two weeks. “You get the behavior of millions of people — for free.”
>
> The type of questions that researchers can now tackle can be compelling.
> Want to know how people perceive e-cigarettes? How people communicate their
> anxieties about diabetes? Whether the Arab Spring protests could have been
> predicted? Social media is a ready source for information about those
> questions and more.
> But despite researchers’ attempts to generalize their study results to a
> broad population, social media sites often have substantial population
> biases; generating the random samples that give surveys their power to
> accurately reflect attitudes and behavior is problematic. Instagram, for
> instance, has special appeal to adults between the ages of 18 and 29,
> African-Americans, Latinos, women and urban dwellers, while Pinterest is
> dominated by women between the ages of 25 and 34 with average household
> incomes of $100,000. Yet Ruths and Pfeffer said researchers seldom
> acknowledge, much less correct, these built-in sampling biases.
>
> Other questions about data sampling may never be resolved because social
> media sites use proprietary algorithms to create or filter their data
> streams and those algorithms are subject to change without warning. Most
> researchers are left in the dark, though others with special relationships
> to the sites may get a look at the site’s inner workings. The rise of these
> “embedded researchers,” Ruths and Pfeffer said, in turn is creating a
> divided social media research community.
> As anyone who has used social media can attest, not all “people” on these
> sites are even people. Some are professional writers or public relations
> representatives, who post on behalf of celebrities or corporations, others
> are simply phantom accounts. Some “followers” can be bought. The social
> media sites try to hunt down and eliminate such bogus accounts — half of all
> Twitter accounts created in 2013 have already been deleted — but a lone
> researcher may have difficulty detecting those accounts within a dataset.
>
> “Most people doing real social science are aware of these issues,” said
> Pfeffer who noted that some solutions may come from applying existing
> techniques already developed in such fields as epidemiology, statistics and
> machine learning. In other cases, scientists will need to develop new
> techniques for managing analytic bias.
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
More information about the Air-L
mailing list