[Air-l] Taxonomy of Content on the Internet

richard-seyler.ling at telenor.com richard-seyler.ling at telenor.com
Wed Sep 20 05:28:35 PDT 2006

Hello all,

All of this brings up another issue, namely how do you sample the web?
What is the universe and what technique would you use to derive a
generalizable sample?

At one point I thought that if you randomly generated a set of, for
example 4 letters such as "SFRW" and put them into google and then took
the Nth entry in the list you would get a random sample of web content.

It turns out that SFRW turns up things like: 

Skype Feature Request Workflow
Santa Fe River Watershed

In thinking about it, the use of random letters would result in a lot of
acronyms for groups (and thus miss, for example porn pages).

Another technique would be to use a random word generator (open the
dictionary to a random page and point at a word and then use it).  Here
you would get a lot of interesting words, but the search would depend on
the language of the dictionary.

Thus, neither of these approaches would work.  Does anybody have any
other approaches?

Rich Ling

More information about the Air-L mailing list