[Air-L] WARC File viewer
steffen.schilke at gmail.com
Wed Feb 17 23:18:06 PST 2010
thank you for the answers. I think there should be an open implementation of
a (standalone) viewer for WARC files which would also allow to use another
archiving system to store these files. In addition it would be possible to
view / browse single WARC files (pages stored in WARC files). I also would
see the need to "export" a single page with all components e.g., to proof
how a web page look at a certain point in time (e.g., for legal reasons,
historic research, etc.).
Speaking of Heritrix: I was reading the manual and I have a little problem
understanding how I can set up a crawl job. My task would be to archive
only certain pages in a crawl job, i.e., I want to give Heritrix a list of
URLs referring to one page each and I want them to be collected (including
all components of that page (e.g., PDF files, images, ...). Anyboy here
which could give me a hint / sample job definition?
Thank you and Kind regards
On Thu, Feb 18, 2010 at 1:06 AM, Baden Hughes <baden.hughes at gmail.com>wrote:
> WARC's are a standard web archiving file format
> (http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml); its
> an open standard.
> Usually you would use a web archiving tool like Wayback Machine or the
> underlying open source software (the Heretrix web crawler to collect
> web content, the NutchWAX indexing engine to provide search services,
> and Wayback to provide the user interfaces), or a service from
> Archive-IT (subscription to custom web archiving service -
> www.archive-it.org) to view these files.
> I don;t know of a specific viewer for WARCs.
> On Thu, Feb 18, 2010 at 10:06 AM, Steffen Schilke
> <steffen.schilke at gmail.com> wrote:
> > Dear *,
> > could you kindly recommend me a viewer for WARC files (web page
> > Kind regards
> > .
> > _______________________________________________
> > The Air-L at listserv.aoir.org mailing list
> > is provided by the Association of Internet Researchers http://aoir.org
> > Subscribe, change options or unsubscribe at:
> > Join the Association of Internet Researchers:
> > http://www.aoir.org/
More information about the Air-L