[Air-l] SW to store pages - scraping and robots

Bernie Hogan bernie.hogan at utoronto.ca
Sun Jun 5 19:47:35 PDT 2005


I admit I haven't used much of the software mentioned above, although I have
substantial experience with spidering and scripting. 200 protest group
websites is a significant amount to archive even with wget.

If you are serious about archiving such a massive amount of content I would
recommend getting acquainted with some heavy duty tools, such as perl, or
python. These are generally easier to access through a mac or Gnu/linux box
than a windows one, but of course, you can code these things under windows.
Its also the case that such tools can work with wget to automate scripting
on a higher level. 

If you are using a mac, the automator (which is free if you have the latest
OS*), can actually get a lot of this done for you if you set it up
correctly. 

My best recommendation right off is to purchase the book 'spidering hacks'
published by orielly. Most of the scripts are written in perl, but some are
in python (which is generally understood to be more readable and beginner
friendly). 

Be careful when you scrape. Check the robots.txt file at the domain level
for example http://www.google.com/robots.txt. If your aren't allowed to
spider it, then perhaps you need some sort of ethics approval to capture it
for academic purposes [if not, I feel you should require this approval, and
to open a can of worms - I think the AoIR guidelines should reflect this].
If a site doesn't want you to scrape it (as indicated in the robots.txt),
you might consider actually contacting these people and maybe even asking to
host a mirror (which would be ideal, and respectful). In return for
mirroring the site, of course you get your data.

Take Care,
BERNiE

*P.S. An addenda about Mac's latest OS - Tiger does NOT run SPSS, so if you
depend on it (as I currently do :(  - be ready to switch to STATA or R for
quant work, as SPSS seems to be slack on their Mac development cycles.

Bernie Hogan
PhD Student
Department of Sociology
NetLab, Knowledge Media Design Institute
University of Toronto

I received a message from s.vicari at reading.ac.uk at approximately 6/5/05
8:59 AM. Above is my reply.

> Hi,
> 
> I am a PhD student at the University of Reading, Uk. I am running a study on
> 200 protest group websites. Would you suggest any good SW to store whole
> websites offline?
> 
> Thanks a lot, at the moment I am a bit lost in links and buttons...
> ste
> 
> 
>> 
> Stefania Vicari
> PhD student in Sociology
> University of Reading
> PO Box 218,
> Reading, RG6 6AA,
> United Kingdom.
> _______________________________________________
> The Air-l-aoir.org at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> 
> Join the Association of Internet Researchers:
> http://www.aoir.org/

-- 






More information about the Air-L mailing list