[Air-L] WARC File viewer

Steffen Schilke steffen.schilke at gmail.com
Wed Feb 17 23:18:06 PST 2010


Dear *,

thank you for the answers. I think there should be an open implementation of
a (standalone) viewer for WARC files which would also allow to use another
archiving system to store these files. In addition it would be possible to
view / browse single WARC files (pages stored in WARC files). I also would
see the need to "export" a single page with all components e.g., to proof
how a web page look at a certain point in time (e.g., for legal reasons,
historic research, etc.).

Speaking of Heritrix: I was reading the manual and I have a little problem
understanding how I can set up a crawl job. My  task would be to archive
only certain pages in a crawl job, i.e., I want to give Heritrix a list of
URLs referring to one page each and I want them to be collected (including
all components of that page (e.g., PDF files, images, ...). Anyboy here
which could give me a hint / sample job definition?

Thank you and Kind regards

sws

On Thu, Feb 18, 2010 at 1:06 AM, Baden Hughes <baden.hughes at gmail.com>wrote:

> WARC's are a standard web archiving file format
> (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its
> an open standard.
>
> Usually you would use a web archiving tool like Wayback Machine or the
> underlying open source software (the Heretrix web crawler to collect
> web content, the NutchWAX indexing engine to provide search services,
> and Wayback to provide the user interfaces), or a service from
> Archive-IT (subscription to custom web archiving service -
> www.archive-it.org) to view these files.
>
> I don;t know of a specific viewer for WARCs.
>
> Baden
>
>
> On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke
> <steffen.schilke at gmail.com> wrote:
> > Dear *,
> >
> > could you kindly recommend me a viewer for WARC files (web page
> archiving).
> >
> > Kind regards
> >
> >
> > .
> > _______________________________________________
> > The Air-L at listserv.aoir.org mailing list
> > is provided by the Association of Internet Researchers http://aoir.org
> > Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >
> > Join the Association of Internet Researchers:
> > http://www.aoir.org/
> >
>



More information about the Air-L mailing list