[Air-l] SW to store pages - scraping and robots

Sun Jun 5 20:20:12 PDT 2005

Bernie Hogan wrote:

>I admit I haven't used much of the software mentioned above, although I have
>substantial experience with spidering and scripting. 200 protest group
>websites is a significant amount to archive even with wget.
>  
>
It really depends on the web presence of these sites. Back in 1998, I 
captured 500 (!) New Age websites (with, err, wget ;-) ) and they all 
fitted zipped on 50 FDD (that's 70MB!!). Now, granted that things have 
changed since then, but it might still be possible to do a reasonable 
job, given how poor many movement sites are (I assume NOW, attac or some 
other professional SMOs are not part of the sample ;-) ).

[ack. snip]

>Be careful when you scrape. Check the robots.txt file at the domain level
>for example http://www.google.com/robots.txt. If your aren't allowed to
>spider it, then perhaps you need some sort of ethics approval to capture it
>for academic purposes [if not, I feel you should require this approval, and
>to open a can of worms - I think the AoIR guidelines should reflect this].
>  
>

wget actually has an option to ignore "robots.txt" and even an option to 
pose as IE or any other browser for that matter (as do HTTrack  and 
WebCopier ;-) ). I personally wouldn't have any problems to activate 
that option. But that's a political decision I feel is better decided by 
national or supranational polities than a voluntary associations such as 
aoir.

-- 
thomas koenig, ph.d.
http://www.lboro.ac.uk/research/mmethods/staff/thomas/