[Air-L] Wikipedia Sampling

Fri Sep 25 05:42:04 PDT 2015

For what it's worth, the machine learning company Lateral has actually used
raw data (available back to 2007 at
http://dumps.wikimedia.org/other/pagecounts-raw) to produce just such a
data set as I think Alex is describing, i.e., a "most popular content on
Wikipedia" corpus. You can read more about their approach in a blog post
here: https://blog.lateral.io/2015/06/the-unknown-perils-of-mining-wikipedia
("The Unknown Perils of Mining Wikipedia").

In particular, it seemed to me that some of the technical details of how
they worked with page view data and content dumps, plus their consideration
of how to handle bot-created content (even the very idea to plan for how to
handle it), might be of interest to you. (If I understand correctly, bots
are permitted on Wikimedia sites if they are "harmless" and approved, but
not all bots are necessarily known, let alone evaluated.)

Have you also considered reaching out to the Wikimedia Research team
directly?
https://www.mediawiki.org/wiki/Wikimedia_Research/Research_and_Data

Cheers,

Cory Salveson

On Wed, Sep 23, 2015 at 12:23 PM, Alex Halavais <alex at halavais.net> wrote:

> Hi, Josh,
>
> It depends, of course, on what you are sampling *for*. A "constructed
> week" is generally based on viewing patterns, and so I suppose you
> could use traffic data to oversample the most popular pages. Or focus
> on the front page.
>
> The most obvious here is to just randomly sample. In doing so, you
> will find a very large number of articles--some of them
> autogenerated/imported--that have never been touched.
>
> If you haven't, you might consider copying this question over here as well:
>
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> In sum, though, any sampling method that draws on edit histories to
> study edit histories is probably a problem--ends up wagging the dog a
> bit. I guess you could use this:
>
> https://aws.amazon.com/datasets/wikipedia-page-traffic-statistics/
>
> to sample based on visitors, but that's a dated collection. I'm sure
> getting the traffic data from somewhere is a possibility, but seems
> like a lot of work to create a "constructed week."
>
> Best,
>
> Alex
>
>
> On Wed, Sep 23, 2015 at 8:33 AM, Joshua Braun <jabraun at journ.umass.edu>
> wrote:
> > Hi All,
> >
> > Just a brief question for the list: I'm considering doing a study that
> looks at the edit histories of a sample of Wikipedia articles, and I'm
> wondering if there are accepted strategies for assembling a
> "representative" sample of Wikipedia articles akin to the way that, say,
> television researchers put together a composite week for content analyses.
> >
> > Obviously any sampling strategy will come with limitations, upsides, and
> downsides. I'm mostly curious as to whether there are accepted sampling
> methods that have emerged in the literature dealing with Wikipedia.
> >
> > Thanks!
> >
> > All the Best,
> > Josh
> > --
> > Josh Braun, Ph.D.
> > Assistant Professor of Journalism Studies
> > Journalism Department
> > University of Massachusetts Amherst
> >
> > @josh_braun
> > Skype: wideaperture
> > http://wideaperture.net/
> >
> > "Maybe the only gift is a chance to inquire, to know nothing for
> certain.  An inheritance of wonder and nothing more."
> > William Least Heat-Moon
> >
> > Sent from Emacs
> > _______________________________________________
> > The Air-L at listserv.aoir.org mailing list
> > is provided by the Association of Internet Researchers http://aoir.org
> > Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >
> > Join the Association of Internet Researchers:
> > http://www.aoir.org/
>
>
>
> --
>
> // Alexander Halavais, Sociologist, Semiologist, and Saboteur
> Extraordinaire
> // Associate Professor of Social Technologies, Arizona State University
> // http://alex.halavais.net/bio     @halavais
>
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>