[Air-L] experiences making large web archives datasets accessible for research?

Fri Sep 29 08:11:41 PDT 2017

As the landscape of copyright and fair use continues to evolve and as more
and more academic research relies on large datasets that are often, at
least in part, compiled from open web crawling, especially emerging areas
like neural network training datasets, its interesting to think about how
the world's web archives might make more of their holdings available for
academic research. In the US, for example, most university IRB's I've
spoken with treat web-derived datasets as "exempt" regardless of the
sensitivity of the questions being asked and some institutions that have
additional pre-IRB data reviews appear to waive those in at least some
cases when it comes to web data (
https://www.forbes.com/sites/kalevleetaru/2017/09/16/ai-gaydar-and-how-the-future-of-ai-will-be-exempt-from-ethical-review/)
and thus web crawled data is becoming especially popular.

While technical limitations do surface as concerns, the most common issue
I've heard from web archives regarding why they don't open their holdings
more broadly to data mining access revolves around copyright law and their
interpretations of fair use when it comes to academic data mining (and of
course the landscape of copyright and "fair use" exceptions vary
dramatically across the world).

Thought many on this list would find of interest a piece I put out
yesterday talking with Common Crawl and their approach to fair use and
recommendations for web archives considering making their archives more
accessible to data mining access:

https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/

While obviously the notion of just what counts as "fair use" or its
equivalent is highly contested and varies from country to country (if it
exists at all in a form amenable to data mining), for a followon piece I'm
doing, I'd love to hear from anyone on this list who has released similar
large archives of web content for open research and the legal
justifications you used and your experiences there and any adjustments you
made to the collection that your counsel felt made the fair use argument
stronger and whether you distributed just the raw HTML, whether you
included imagery, etc, and whether you just posted a download link or
whether you required a signed researcher agreement first, and whether you
distributed the content to their machines or required it to be processed
locally.

There are obviously a tremendous number of opinion pieces out there and
legal arguments and briefs provided by a myriad organizations for and
against archives being able to box up large holdings of web pages and make
them available for data mining, so I'm particularly interested in
real-world examples of where groups have actually made large collections of
web pages available to others for data mining and how they accomplished
that and the considerations and concessions they made that they believe
ensured their work complied with fair use or its equivalent in their
country (rather than opinion pieces that just talk about how it can or can
not and should or should not be done).

Feel free to respond to me off list if you'd prefer.

Thanks so much!

Kalev