[Air-l] Mapping the net with crawlers/robots

elijah wright elw at stderr.org
Wed Oct 20 08:59:11 PDT 2004



Most of this can be done pretty easily with either perl scripts (see the 
module WWW::Spyder on CPAN, or see 
http://search.cpan.org/~ashley/WWW-Spyder-0.18/Spyder.pm) or with shell 
scripts and command-line flags to wget.  It is definitely not a 
plug-in-and-go kind of task, however - you are going to have to invest 
some serious time to get it working and working *right* for your data 
collection needs.

Adjacency matrix assembly is quite another problem.  The raw matrix 
produced by such a crawl will very quickly outstrip your ability to either 
store or analyze it, for any sizeable chunk of data collected.

--elijah


> I am trying to analyze the relationships between organizations on the 
> web. In particular, I want to map the linking behavior of a set of 
> organizations subjectively defined.
>
> I would be most grateful if someone could indicate me whether there
> exists some web crawler that allows to define
>   - a set of URLs from where to start the crawl
>   - the depth - how many levels one wants to look in a given target
> domain
>   - and number of iterations -how far from the original URL domain one
> wants to go.
>   - and a few filters -limit specific types of pages (pdf for example)
>
> and returns either a map, a table of relationships (some sort of
> adjacency matrix) or both.
>
> Thanks in advance,
>
> Rafel Lucea
> MIT - Sloan School of Management
>
> _______________________________________________
> The Air-l-aoir.org at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://aoir.org/airjoin.html
>



More information about the Air-L mailing list