[Air-l] Re: Air-l digest, Vol 1 #96 - 11 msgs

Lee Giles giles at ist.psu.edu
Thu Aug 30 10:06:22 PDT 2001


Hi:

The papers in Science and Nature that describe the work that Dr.
Monaco mentioned can be found at

http://www.ist.psu.edu/faculty_pages/giles/publications/

Dynamically generated pages are hard to count and are usually not
counted. We now estimate today from
growth patterns that there are about 4 to 5 billion publicly indexable
web pages. Google seems to have the largest database. Any index generated
by a search engine is usually not counted as a web page. Estimates of the
dark web seem to based on the capture/capture method discussed in
our Science paper and are probably under-estimated due to the innate
limitations of this approach. Duplicate pages are hard to eliminate but much
algorithmic work has gone into recognizing them. This is still an open issue
however. Furthermore, one can argue that some near duplicates should
be counted.

For an interesting study as to how much data/information both unique
and duplicate there is in the world and how much is being produced, see:

http://www.sims.berkeley.edu/research/projects/how-much-info/

Best regards,

Lee Giles

air-l-request at aoir.org wrote:

>
>
> Message: 10
> From: "Ellis Godard" <ellisgodard at starband.net>
> To: <air-l at aoir.org>
> Subject: RE: [Air-l] Re: Air-l digest, Vol 1 #94 - 4 msgs
> Date: Tue, 28 Aug 2001 19:26:52 -0700
> Reply-To: air-l at aoir.org
>
> And how does one count dynamically generated pages? Are there as many web
> pages as books available through Amazon? Is google's index counted as only
> two web pages even though almost every instance of the second one is
> different?
>
> -----Original Message-----
> From: air-l-admin at aoir.org [mailto:air-l-admin at aoir.org]On Behalf Of
> monaco
> Sent: Tuesday, August 28, 2001 11:38 AM
> To: air-l at aoir.org
> Subject: [Air-l] Re: Air-l digest, Vol 1 #94 - 4 msgs
>
> number of web pages worldwide
>
> Regarding number of web pages world wide, I was informed that C. Lee Giles
> at Pen State (www.ist.psy.edu/faculty/giles.html) has developed some tools
> to sample and estimate the number of pages.  I also learned of the
> distinction between the dark web (pages inaccessible to crawlers) and the
> pages that available and referenced via search engines.  It seems that one
> estimate has the dark web at 90% of total pages.
>
> Greg Monaco
>
> Gregory E. Monaco, Ph.D.
> Program Director, Advanced Networking
> National Science Foundation
> 703-292-8948
>
> _--

Dr. C. Lee Giles, David Reese Professor
School of Information Sciences and Technology
and Computer Science and Engineering
The Pennsylvania State University
University Park, PA, 16801, USA
giles at ist.psu.edu - 814 865 4461
http://ist.psu.edu/giles






More information about the Air-L mailing list