[Air-L] Using the Archive.org for data capture?

Jefferson Bailey jefferson at archive.org
Mon Apr 20 16:29:34 PDT 2015


Hi all,

I am also happy to discuss more off list. We have worked with Matt and many others on providing access to web archive data for researchers interested in studying/mining the historical web. There are a number of initiatives within Internet Archive to augment research services and access models and I'll document them on this list as they ramp up. We're definitely excited to see more researchers interested in web archives.

There are some technical and methodological differences for web archives as far as issues related to provenance, validity, format, completeness, and so on, though many of these concerns are no different than those encountered using traditional (analog) archival materials; they often just *feel* more immediate or loaded given our unique relationship with the web and its affordances and contingencies as a quote-unquote documentary record. 

Cheers,
Jefferson

Jefferson Bailey
Program Manager & Interim Co-Director
Web Archiving Services & Programs
Internet Archive
jefferson at archive.org


On Apr 20, 2015, at 13:54, Matthew Weber <matthew.weber at rutgers.edu> wrote:

> Dan,
> 
> Rogers’ digital methods work is a broad starting point, although I’m not sure that he’s specifically addressed issues with the Internet Archive. 
> 
> I’ve been working on research derived from the Internet Archive for almost a decade now, mostly at a large scale, although some projects are smaller in nature. One starting point might be this paper http://dl.acm.org/citation.cfm?id=2579213 <http://dl.acm.org/citation.cfm?id=2579213>  and I have some other published work using derived datasets.
> 
> With regards to your question about validity, it depends in part on what you’re looking to explore. If you’re using smaller datasets, validity won’t be too much of an issue, but once you scale beyond a few dozen domains (and again, depending on your analysis and RQs) there are validity issues that must be addressed. We’ve started to outline these in a few related papers that are under review but mostly it pertains to issues of sampling error and data completeness.
> 
> Feel free to ping me offline - I can point you to GitHub code and other work, depending on your goals - and definitely check out the work of others. Kalev Letaaru is active on here and works in this space, as does Neils Brugger at Aarhus. There is a growing community of researchers doing Internet Archive-related research.
> 
> Regards,
> Matt
> 
> 
> 
> 
> 
>> On Apr 20, 2015, at 4:46 PM, Matthew T Mccarthy <mccart74 at uwm.edu> wrote:
>> 
>> Apologies for the curt message. I hit send before finishing. 
>> In addition to the citation for his book, here is a link to the Ditigal Methods Initiative wikipage
>> 
>> https://wiki.digitalmethods.net
>> 
>> Best, 
>> Matt
>> 
>> 
>> Matthew T. McCarthy
>> Ph.D. Student/Graduate Instructor
>> Department of Sociology
>> University of Wisconsin-Milwaukee
>> P.O. Box 413
>> Milwaukee, WI   53201
>> 
>> ________________________________________
>> From: Air-L <air-l-bounces at listserv.aoir.org> on behalf of Matthew T Mccarthy <mccart74 at uwm.edu>
>> Sent: Monday, April 20, 2015 3:43 PM
>> To: Dan Fielding; Air-L at listserv.aoir.org
>> Subject: Re: [Air-L] Using the Archive.org for data capture?
>> 
>> Dan,
>> 
>> Richard Rogers of the Digital Methods Initiative has dealt with this.
>> 
>> 
>> Rogers, R. (2013). Digital methods. MIT press.
>> 
>> 
>> Matthew T. McCarthy
>> Ph.D. Student/Graduate Instructor
>> Department of Sociology
>> University of Wisconsin-Milwaukee
>> P.O. Box 413
>> Milwaukee, WI   53201
>> 
>> ________________________________________
>> From: Air-L <air-l-bounces at listserv.aoir.org> on behalf of Dan Fielding <sociologyfornerds at gmail.com>
>> Sent: Monday, April 20, 2015 3:28 PM
>> To: Air-L at listserv.aoir.org
>> Subject: [Air-L] Using the Archive.org for data capture?
>> 
>> Hello wonderful list,
>> 
>> I am currently establishing a research protocol that will rely on the
>> wayback machine (archive.org) to gather caches of pages from 1-2 years ago.
>> Is there research on the wayback machine as an effective mode of data
>> capture? Are there any questions about its validity? Have you read
>> published work using the wayback machine? What concerns have other scholars
>> raised about using it?
>> 
>> Thanks for your time! Have a great day,
>> 
>> Dan Fielding
>> _______________________________________________
>> The Air-L at listserv.aoir.org mailing list
>> is provided by the Association of Internet Researchers http://aoir.org
>> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>> 
>> Join the Association of Internet Researchers:
>> http://www.aoir.org/
>> _______________________________________________
>> The Air-L at listserv.aoir.org mailing list
>> is provided by the Association of Internet Researchers http://aoir.org
>> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>> 
>> Join the Association of Internet Researchers:
>> http://www.aoir.org/
>> _______________________________________________
>> The Air-L at listserv.aoir.org mailing list
>> is provided by the Association of Internet Researchers http://aoir.org
>> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>> 
>> Join the Association of Internet Researchers:
>> http://www.aoir.org/
> 
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> 
> Join the Association of Internet Researchers:
> http://www.aoir.org/



More information about the Air-L mailing list