[Air-L] Wikipedia article edit history extraction tools?

Brian Keegan bkeegan at northwestern.edu
Fri Aug 17 15:43:16 PDT 2012


Wikipedia has its own API which makes a variety of data acquisition tasks
trivial once you get the hang of it. Documentation for API:
http://en.wikipedia.org/w/api.php

For example, extracting the revision IDs, timestamps, users, userids, and
article size (rvprop=ids|timestamp|user|userid|size) for the 500
(rvlimit=500) most recent (rvdir=older) revisions (prop=revisions) to the
article "Feminism" (titles=Feminism) is:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Feminism&rvprop=ids|timestamp|user|userid|size&rvlimit=500&rvdir=older

Adding the content of each revision
(rvprop=ids|timestamp|user|userid|size|content):
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Feminism&rvprop=ids|timestamp|user|userid|size|content&rvlimit=500&rvdir=older

This will return the content in XML, JSON, or other structured data formats
which could be read into Excel, MySQL, etc. with some basic scripting using
Python.

The API offers a variety of other options to return data like which
categories an article belongs to (prop=categories), which other articles
the article links out to (prop=links), which articles editors have
contributed to (list=usercontribs), and parsing prior versions of articles
(action=parse) like what they previously linked to, and many other things.

Analogous to the Twitter API, the Wikipedia API imposes limits on how many
revisions you can return on a single call (500). Libraries like
python-wikitools (http://code.google.com/p/python-wikitools/) will seamless
handle making the calls so that you can get the complete history.

Example Python 2.7.3 code using python-wikitools:

from wikitools import wiki, api

def do_complete_wikipedia_query(query_params):
site = wiki.Wiki() # This defaults to en.wikipedia.org
request = api.APIRequest(site, query_params)
result = request.query()
return result[query_params['action']]

def get_article_revisions(article_name):
    '''
    Input:
    article_name - A string containing the title of an article as it would
appear in the URL,
    e.g., "Northwestern University"
    Output:
    revisions - A list of revisions for the given article, each given as a
dictionary. This will
            include all properties as described by revision_properties, and
will also include the
            title and id of the source article.
    '''
    revisions = []
    try:
        result = do_complete_wikipedia_query({'titles': article_name,
                                  'prop': 'revisions',
                                  'rvprop':
'ids|timestamp|user|userid|size',
                                  'rvlimit': '500',
                                  'rvdir': 'newer',
                                  'action':'query'}
                                 )
        if result and 'pages' in result.keys():
                page_number = result['pages'].keys()[0]
                r = result['pages'][page_number]['revisions']
                r = sorted(r, key=lambda revision: revision['timestamp'])
                for revision in r:
                    revision['pageid'] = page_number
                    revision['title'] = article_name
                    # Sometimes the size key is not present, so we'll set
it to 0 in those cases
                    revision['size'] = revision.get('size', 0)
                    revisions.append(revision)
        print u"{0} revisions found in
{1}".format(len(revisions),article_name)
    except api.APIError:
        print u"{0} does not exist".format(article_title)
        pass
    return revisions

feminism_revisions = get_article_revisions('Feminism")

On Wed, Aug 15, 2012 at 3:00 PM, <air-l-request at listserv.aoir.org> wrote:
>
> Message: 1
> Date: Tue, 14 Aug 2012 15:38:53 -0700
> From: Monika Sengul-Jones <jones.monika at gmail.com>
> To: air-l at listserv.aoir.org
> Subject: [Air-L] Wikipedia article edit history extraction tools?
> Message-ID:
>         <CAD_EfP+5jdaG8MDPfGMSwMO=
> vCifMU_bR8+7MPvDUF5Yb2AVyQ at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hello Air-L list:
>
> This summer I'm doing research on Wikipedia entries in the field of Science
> and Technology Studies. A central question I'm asking is the extent to
> which this field, as it is now on Wikipedia, includes/features/references
> contributions made by women, feminist theorists, and feminist theory.
>
> To answer this, I'm gathering data on existing pages using a variety of
> mixed methods. I would like to ask for recommendations on tools for
> extracting the history of editing on a page. I want to see how many times a
> given article has been edited, by whom, and what types of edits and content
> contributions are made over time. So far, I've found the "history" tool on
> the Wikipedia page limited. I cannot see how many edits have been made on a
> particular article and understanding what kinds of edits are made (e.g.
> grammatical,  content) requires going into each historical page view. I'd
> love to find a way to download the history of an article and extract the
> data into a spreadsheet -- perhaps this is a tall order.
>
> So far, I've found tools for extracting data on Wikipedia from the Digital
> Methods Initiative website (which was first introduced to me by this list
> serve! :)). Specifically, the program History Flow is useful to an extent
> for visualizing types of content contributions and edits over time. But
> there is no way to translate these visualizations into a spreadsheet format
> -- as far as I can tell -- so I've been doing that manually, somehow
> piecing together the history of edits on an article. Meanwhile, I was
> recommended a tool called WikiChecker (
> http://en.wikichecker.com/article/?a=science_studies) but the summary
> format is limited and, at times, contradictory to data I get elsewhere.
>
> If anyone has any other tools or methods to suggest for ways to collect
> data on content contributions and edits on Wikipedia I would be most
> grateful.
>
> I'd also be happy to be in conversation with anymore interested in the
> concept of the project. I'm working on it as a part of the FemTechNet
> Initiative, spearheaded by Anne Balsamo and Alexandra Juhasz. I'm not sure
> if information on the initiative has circulated here, so I'll paste in a
> copy of the "call" which took place last spring. *
>
> http://aljean.files.wordpress.com/2012/05/femtechnet-long-form-invite-may-2012.pdf
> *
>
> Thank you,
> Monika
>
> --
> Monika Sengul-Jones
> Graduate Student
> Communication & Science Studies
> University of California, San Diego
> msengul at ucsd.edu
>
>

-- 
Brian C. Keegan
Ph.D. Student - Media, Technology, & Society
School of Communication, Northwestern University

Science of Networks in Communities, Laboratory for Collaborative Technology



More information about the Air-L mailing list