[Air-l] SW to store webpages

Mon Jun 6 13:34:00 PDT 2005

On 6/5/05 1:36 PM, "Jeremy Hunsinger" <jhuns at vt.edu> wrote:

> I use wget or mirror.pl and then there are the tools represented at
> http://webarchivist.org/  too
> On Jun 5, 2005, at 10:10 AM, elijah wright wrote:
> 
>> 
>> 
>>> I am a PhD student at the University of Reading, Uk. I am running
>>> a study on 200 protest group websites. Would you suggest any good
>>> SW to store whole websites offline?

The tools represented at http://webarchivist.org are for somewhat more
elaborate research approaches than many individual scholars are interested
in developing  -- but let me try to explain our thinking on this topic.

WebArchivist was created to solve the problem of making regular periodic
copies of a number of sites or pages; retrieving the archived objects by URL
and date; indexing, cataloguing and/or analyzing the sites / pages; and then
retrieving the archived objects on the basis of researcher/cataloguer
created metadata (i.e. The index, catalog, or analysis fields).  Our tools
seem to be most efficient when the number of objects is relatively large
(dozens to hundreds or even thousands of sites), regular (daily, weekly or
monthly) and sustained (three months to a few years).

Examples of the kinds of archives / collections that can be sustained
include the Web spheres we've analyzed around the 2002 US Election and the
September 11 terrorist attacks; both are presented at
http://www.loc.gov/minerva; additional scholarly data on the 2002 election
collection is presented at http://politicalweb.info.

We strongly encourage scholars to work closely with librarians at their
institutions to see if they are willing to work with you to store your
collection for future researchers.  Alternatively, consider working with us
or perhaps the Internet Archive to store your collection and the data about
the Web objects that you collect.

If you are interested in making a collection accessible to other
researchers, even others in your own research group, you will need to
consider how to serve the objects in the collection.  If you have any
concerns about preservation, or concerns about representing the data in as
close to the observed form as possible, you may wish to consider the
crawlers that do not change the HTML code.  Some programs, such as teleport
pro, and wget in some of its usages do -- while rewriting HTML code to make
links readable may make your initial observation easier, subsequent
researchers may find your data very difficult to interpret.  And it may be
difficult or impossible to house your collection as an archive.

Most recently, we've been using the Heritrix crawler and saving our data
into ARC files.  This creates an additional challenge of reading the ARC
files, however.  There are some tools out there that help -- check out
http://www.netarchive.dk/website/sources/index-en.htm.

This thread raises interesting issues about our ability as scholars to
create datasets (archives) of Web-based materials.  I'd be glad to continue
the discussion if anyone is interested in this.

//steve.

Steven M. Schneider
Associate Professor, SUNYIT:  http://www.sunyit.edu/~steve
Co-Director, WebArchivist:  http://www.webarchivist.org