All of this brings up another issue, namely how do you sample the web?
What is the universe and what technique would you use to derive a
generalizable sample?

At one point I thought that if you randomly generated a set of, for
example 4 letters such as "SFRW" and put them into google and then took
the Nth entry in the list you would get a random sample of web content.

It turns out that SFRW turns up things like: 

Skype Feature Request Workflow
Santa Fe River Watershed

In thinking about it, the use of random letters would result in a lot of
acronyms for groups (and thus miss, for example porn pages).

Another technique would be to use a random word generator (open the
dictionary to a random page and point at a word and then use it).  Here
you would get a lot of interesting words, but the search would depend on
the language of the dictionary.

Thus, neither of these approaches would work.  Does anybody have any
other approaches?

Rich Ling

