[Air-l] SW to store pages - scraping and robots
Thomas Koenig
T.Koenig at lboro.ac.uk
Sun Jun 5 20:20:12 PDT 2005
Bernie Hogan wrote:
>I admit I haven't used much of the software mentioned above, although I have
>substantial experience with spidering and scripting. 200 protest group
>websites is a significant amount to archive even with wget.
>
>
It really depends on the web presence of these sites. Back in 1998, I
captured 500 (!) New Age websites (with, err, wget ;-) ) and they all
fitted zipped on 50 FDD (that's 70MB!!). Now, granted that things have
changed since then, but it might still be possible to do a reasonable
job, given how poor many movement sites are (I assume NOW, attac or some
other professional SMOs are not part of the sample ;-) ).
[ack. snip]
>Be careful when you scrape. Check the robots.txt file at the domain level
>for example http://www.google.com/robots.txt. If your aren't allowed to
>spider it, then perhaps you need some sort of ethics approval to capture it
>for academic purposes [if not, I feel you should require this approval, and
>to open a can of worms - I think the AoIR guidelines should reflect this].
>
>
wget actually has an option to ignore "robots.txt" and even an option to
pose as IE or any other browser for that matter (as do HTTrack and
WebCopier ;-) ). I personally wouldn't have any problems to activate
that option. But that's a political decision I feel is better decided by
national or supranational polities than a voluntary associations such as
aoir.
--
thomas koenig, ph.d.
http://www.lboro.ac.uk/research/mmethods/staff/thomas/
More information about the Air-L
mailing list