[Air-l] SW to store pages - scraping and robots

Dan L Burk burkx006 at umn.edu
Tue Jun 7 13:34:26 PDT 2005


On 5 Jun 2005, Thomas Koenig wrote:
> Bernie Hogan wrote:
>
> >Be careful when you scrape. Check the robots.txt file at the domain
level
> >for example http://www.google.com/robots.txt. If your aren't allowed to
> >spider it, then perhaps you need some sort of ethics approval to capture
it
> >for academic purposes [if not, I feel you should require this approval,
and
> >to open a can of worms - I think the AoIR guidelines should reflect
this].
> 
> 
> wget actually has an option to ignore "robots.txt" and even an option to 
> pose as IE or any other browser for that matter (as do HTTrack  and 
> WebCopier ;-) ). I personally wouldn't have any problems to activate 
> that option. But that's a political decision I feel is better decided by 
> national or supranational polities than a voluntary associations such as 
> aoir.

Those who attended Gove Allen's presentation on academic use of automated
data retrieval 'bots at the Toronto conference will recall that this is
both a legal and an ethical problem.  Gove, Charles Ess, Gordon Davis, and
I have papers on both aspects forthcoming.  Given Charles' involvement in
the project, I know that there is a very keen interest in having the AoIR
guidelines at some point, probably not too far in the future, address at
least the ethical side of the problem.

Oh, and don't just check the robots.txt file -- probably better check any
written TOS pages as well.  The machine-readable and the human-readable
prohibitions aren't always congruent. 

Dan L. Burk
Visiting Professor
Cornell Law School
Myron Taylor Hall
Ithaca, NY 14853 USA

Oppenheimer, Wolff & Donnelly Professor
University of Minnesota Law School
229 19th Avenue South
Minneapolis, MN 55455 USA
***************************************
Voice: 612-626-8726
Fax: 612-625-2011





More information about the Air-L mailing list