[Air-l] Mapping the net with crawlers/robots
elijah wright
elw at stderr.org
Wed Oct 20 08:59:11 PDT 2004
Most of this can be done pretty easily with either perl scripts (see the
module WWW::Spyder on CPAN, or see
http://search.cpan.org/~ashley/WWW-Spyder-0.18/Spyder.pm) or with shell
scripts and command-line flags to wget. It is definitely not a
plug-in-and-go kind of task, however - you are going to have to invest
some serious time to get it working and working *right* for your data
collection needs.
Adjacency matrix assembly is quite another problem. The raw matrix
produced by such a crawl will very quickly outstrip your ability to either
store or analyze it, for any sizeable chunk of data collected.
--elijah
> I am trying to analyze the relationships between organizations on the
> web. In particular, I want to map the linking behavior of a set of
> organizations subjectively defined.
>
> I would be most grateful if someone could indicate me whether there
> exists some web crawler that allows to define
> - a set of URLs from where to start the crawl
> - the depth - how many levels one wants to look in a given target
> domain
> - and number of iterations -how far from the original URL domain one
> wants to go.
> - and a few filters -limit specific types of pages (pdf for example)
>
> and returns either a map, a table of relationships (some sort of
> adjacency matrix) or both.
>
> Thanks in advance,
>
> Rafel Lucea
> MIT - Sloan School of Management
>
> _______________________________________________
> The Air-l-aoir.org at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://aoir.org/airjoin.html
>
More information about the Air-L
mailing list