[Air-l] Sofware to capture content
James Howison
jhowison at syr.edu
Wed Mar 1 12:35:57 PST 2006
On Mar 1, 2006, at 1:27 PM, Julio Meneses Naranjo wrote:
> Eulàlia,
>
> *I was wondering if any of you know about software to
> *capture website content – specifically, to capture online
> *news outlets (CNN, The Washington Post, The New York
> *Times
> ) as well as blog-types news.
> *We are about to engage in a research involving content
> *coding these sites and were wondering if anybody has
> *information on costs (any free out there?),
I've been spidering websites for data collection (on open source
software development) and then content analysis for a while.
We use perl with WWW::Mechanize gathering pages into a MySQL
database, then parsing the content we actually need into other
database tables and outputting it in a text format suitable for
coding in Atlas TI, Hyperesearch etc. It's far from point-and-click
but has worked quite well for our research needs and is of course
free (just add labour!). It's also open source: http://
ossmole.sourceforge.net/ and my colleagues there have java code that
does similar work. Check the CVS tree, it would require a lot of
customization but demonstrates the process.
If the sites have an RSS, or similar, feed that would be a lot easier
to collect and parse for the text content you need, compared to
spidering and saving raw html pages.
> ease of use,
WWW::Mechanize is a programming module, but has a remarkably easy to
use API capable of simulating a browser clicking links, filling
forms, storing cookies etc.
http://search.cpan.org/~petdance/WWW-Mechanize-1.18/lib/WWW/Mechanize.pm
This desktop gui software also looks good, although I haven't tried
it (and it costs $99):
http://www.metafy.com/index.html
"Visually construct Spiders and Scrapers without scripting"
> *effectiveness in capturing content, time needed to capture
> *content at a point in time, time needed to capture 24-hour
> *content, and any other pertinent information that you may
> *want to share.
We wrote a paper about the perils and pitfalls in such web-mining
that might be of use (http://floss.syr.edu/publications/
howison04msr.pdf), and have some materials from a workshop I gave on
how to do it (available on request).
Spiders can capture content very quickly, but beware of two things.
Check the robots.txt file for areas they don't want spidered, and
especially if it is a small server build a sleep cycle into your
grabs to spare their servers (WWW::Mechanize::Sleepy does this
automatically). Also it nevers hurts to write first asking for
access to the backend database, before going to the substantial
effort of spidering, right?
Cheers,
James
http://james.howison.name
http://floss.syr.edu
More information about the Air-L
mailing list