[Air-l] SW to store webpages

Alex Halavais halavais at gmail.com
Wed Jun 8 14:11:22 PDT 2005


I guess I will weigh in on the liberal collecting side here:

COPYRIGHT

Agreeing with Dan's post, suggesting that fair use is about as far
from clear cut as possible, it strikes me that there is potential harm
in being over-cautious on this. I guess I am living (ever so slightly)
dangerously when I make private copies of websites as part of my
research, but I also think it is necessary, fair, and just. I would
hate to see us as a community give up our rights to make use of
copyrighted material. Our research does (or at least ought to) serve
the public welfare, and we should assert our rights to use copyrighted
works appropriately. By shying away from this, I think we adversely
affect the future enforcement of copyright.

It is easy enough for publishers to limit access to materials on the
web by use of passwords and the like, and as soon as they do, it is
clearly no longer publicly published and that's a whole different
situation. But particularly for sites that are deliberately put into
the public eye, I think we have a responsibility as researchers to
access and archive that material as our research demands.

ROBOTS.TXT / TOS

There needs to be some balance here in terms of coverage. While I may
be in the minority, I don't think that it is vital that robots.txt
*always* be followed. (For practical reasons, especially with dynamic
sites, it may be a good idea, but I don't think it is an absolute.) If
my robot behaves in such a way that it is indistinguishable from a
gaggle of humans loading stuff on their browsers and saving, then I
see no reason I shouldn't be able to use a robot.

Most robot.txt prohibitions exist because web authors are looking for
a way to shape the search engines' report of their site. They have not
predicted the use of the content by researchers.

On the other hand, some robots (built-in IE engine mentioned above,
iirc, as well as Acrobat) seem to behave not at all like a human,
delivering a huge number of simultaneous or sequential requests,
without appropriate delays. Here, the harm (or potential harm) is much
clearer.



More information about the Air-L mailing list