[Air-l] counting google hits

Thomas Koenig T.Koenig at lboro.ac.uk
Sat Mar 5 15:32:07 PST 2005


Elizabeth,
Scrive "Van-Couvering,EJ (pgr)" <E.J.Van-Couvering at lse.ac.uk>:

> While it is true that Google doesn't edit contents, from my research, I
> think it is safe to say there is a lot underneath this "efficient
> indexing" of websites.  Each search engine provider (of which there are
> 3-4 major ones - Google, yahoo, Microsoft, and AskJeeves) is striving
> not only to produce the most efficient index of results but the most
> "relevant" index of results.  This is a tricky issue - is a neo-Nazi
> site the most relevant hit for the search "Jew"?

First off, the discussion started with

"folks realize that using the "number of hits returned on google" is a
hilarious bad way to prove a point -- right?"

I took issue with that statement, because I know in many circumstances of no
better way to test everyday language use. I was thus defending the *number*
of hits rather than their *rankings* as a good indicator (there are
exceptions, obviously, pornographic language will be overrepresented).

But, I would also vouch, albeit with more caution, for the ranking: If the
ranking would not suit most internet users, google would not have such a
success. Even though the path dependency argument would go some way, it
would do just that: to go *some* way: I don't remember what was the first
"search engine" I used, but I do remember switching from Webcrawler to
altavista some time in March 1997 and from altavista to Google in late
1999, even though I was already quite content with both Webcrawler and
altavista at the time. Thus, google appears to churn out the most relevant
hits to most internet users, which of course, are not a random sample of
the global population.

Now, let's have a look at the query "Jew," which because of its ambiguity
appears pretty non-sensical to me as a search other than for research
purposes, investigating, what is associated with the word "Jew." Well, my
guess is, you will find many anti-Semitic sites for four reasons (you'd
wish these were neo-Nazi sites, but anti-Semitism is not at all confined to
the Nazi scene):

1) Anti-Semitism is a large and growing global phenomenon.
2) Anti-Semitic statements on the web are overrepresented.
3) Infamous works such as "The Eternal Jew" and "The International Jew"
contain the word "Jew" in their title. Partly the reason for this choice of
wording is:
4) Anti-Semites will have a propensity to use the word "Jew" in singular, as
for them Jewishness is an indelible personal attribute, which defines the
character of a person. In contrast, other people will more often speak of
"Jewish" or, maybe sometimes, "Jews". "I'm Jewish" is certainly be the
preferred wording over "I'm a Jew."

And, sure, enough, Google churns out for UK-IP queries (I hate this
national(ist) bias "feature" of google to outguess me, from which country 
and in which language I would like to have my research results) a revolting
Anti-Semitism site as first hit, the Wikipedia entry as second hit, an
academic site as third hit, a conscious effort to knock off the
anti-Semitic site as first hit in fourth place, and Henry Ford's tractate
in fifth. All these seem very relevant hits to me, if your question is
"what is associated with the word 'Jew'."

Interestingly enough, google felt compelled to explain their research
results along pretty much the same lines as I just did:

http://www.google.com/explanation.html

> Or, more commonly,
> when someone searches for "apple" do they mean the fruit or the
> computer?

The homonym problem cannot be solved through search engine diversification.

> It also means, as any search engine optimiser will tell you,
> that there is a lot of very active blacklisting of sites which are
> perceived to be fraudulent.  Therefore I think that a concentrated
> search market is likely to be a bad thing: we may want choice in what
> kind of results we think are relevant.

Well, most users of google have decided that they rather not wade through
zillions of bait sites from the porno industry. Why should
machine-generated webpages get the same relevance score as human-generated
sites? I am not a believer in all market solutions, but in this case the
market solution seems to yield the best results.

> Certainly those who live outside
> the major advertising markets are finding that their versions of the
> internet are not particularly well-searched, as commerce drives the
> indexing efforts of all the major engines.

Fair enough, but a proliferation of the market would likely not alleviate,
but aggravate that problem, as it would mean a diversification (read:
compartmentalization) of the market.

Thomas (not affiliated with Google in any way)
--
thomas koenig, ph.d.
department of social sciences, loughborough university, u.k.
http://www.lboro.ac.uk/research/mmethods/staff/thomas/index.html



More information about the Air-L mailing list