[Air-l] Sofware to capture content

James Howison jhowison at syr.edu
Wed Mar 1 12:35:57 PST 2006





On Mar 1, 2006, at 1:27 PM, Julio Meneses Naranjo wrote:

> Eulàlia,
>
> *I was wondering if any of you know about software to
> *capture website content – specifically, to capture online
> *news outlets (CNN, The Washington Post, The New York
> *Times
> ) as well as blog-types news.
> *We are about to engage in a research involving content
> *coding these sites and were wondering if anybody has
> *information on costs (any free out there?),

I've been spidering websites for data collection (on open source  
software development) and then content analysis for a while.

We use perl with WWW::Mechanize gathering pages into a MySQL  
database, then parsing the content we actually need into other  
database tables and outputting it in a text format suitable for  
coding in Atlas TI, Hyperesearch etc.  It's far from point-and-click  
but has worked quite well for our research needs and is of course  
free (just add labour!).  It's also open source:  http:// 
ossmole.sourceforge.net/ and my colleagues there have java code that  
does similar work.  Check the CVS tree, it would require a lot of  
customization but demonstrates the process.

If the sites have an RSS, or similar, feed that would be a lot easier  
to collect and parse for the text content you need, compared to  
spidering and saving raw html pages.

> ease of use,

WWW::Mechanize is a programming module, but has a remarkably easy to  
use API capable of simulating a browser clicking links, filling  
forms, storing cookies etc.

http://search.cpan.org/~petdance/WWW-Mechanize-1.18/lib/WWW/Mechanize.pm

This desktop gui software also looks good, although I haven't tried  
it (and it costs $99):

http://www.metafy.com/index.html

"Visually construct Spiders and Scrapers without scripting"

> *effectiveness in capturing content, time needed to capture
> *content at a point in time, time needed to capture 24-hour
> *content, and any other pertinent information that you may
> *want to share.

We wrote a paper about the perils and pitfalls in such web-mining  
that might be of use (http://floss.syr.edu/publications/ 
howison04msr.pdf), and have some materials from a workshop I gave on  
how to do it (available on request).

Spiders can capture content very quickly, but beware of two things.   
Check the robots.txt file for areas they don't want spidered, and  
especially if it is a small server build a sleep cycle into your  
grabs to spare their servers (WWW::Mechanize::Sleepy does this  
automatically).  Also it nevers hurts to write first asking for  
access to the backend database, before going to the substantial  
effort of spidering, right?

Cheers,
James

http://james.howison.name
http://floss.syr.edu






More information about the Air-L mailing list