[Air-l] counting google hits

Thomas Koenig T.Koenig at lboro.ac.uk
Wed Mar 2 17:54:41 PST 2005


Citeren elijah wright <elw at stderr.org>:

[What's wrong with using Google stats?]
> because people assume that all texts that are available are represented,
> which according to the google people they are *not*.

Fair enough, but what is your alternative corpus? Most traditional corpora
have a bias away from everyday language to journalistic and/or literary
writings. Sometimes these bias' may not matter, some other times, they
might be even desirable, but at times google is the better choice, even if
imperfect.

> in other words, the sample that you are pulling numbers from is neither
> complete nor perfect - so your results won't be either.

Who gets unbiased random samples? No-one, not even NORC, who are pretty good
at it. Does that invalidate *all* statistical results? Of course not. Don't
get me wrong, I am all for careful random sampling, but if I cannot get it,
I might, under some circumstances, resort to biased samples, rather than to
not get any sample at all.

> do you understand what google does well enough (details of the algorithm,
> et cetera) to know what the weaknesses are?  oh, you say they haven't
> published enough information for you to know?  that's what i thought.  :|

I do not know, how google indexes (I have a faint idea, though), but for
many practical purposes, it simply does not matter, as long as I do not
suspect a bias of exclusions of websites, which are *systematically
related* to the topic I am researching.

Would I rather have a random sample of all human-generated websits,
preferably with the vital stats of their authors attached? You bet. I just
won't get it. So I am taking the next best thing, aka Google.

> > I am afraid, this is how your argumentation sounds to me. Why should it
> > be wrong to use the number of google hits under all circumstances?
>
> i think your tone is pretty crass.

Funny, that's what I thought of yours, that's why I chose to use *your*
words. You probably know that it's sometimes difficult to discern the tone
when you have no cues other then some ASCII strings.

> > If I want to show that Canada is better known than Vanuatu
> >
> (http://googlefight.com/index.php?lang=en_GB&word1=canada&word2=vanuatu),
> > why would the comparison of google hits be inadmissable? (There are a
> > number of reasons, why the "Vunuatu" hits are inflated, but that is of
> > no concern here).
>
> popularity of a term is one of the few instances in which comparative
> occurrence vis a vis the google corpus *might* be useful.  it would
> depend
> on your question, and whether the data available from the particular
> google server you're connected to is appropriate to answering it.

Of course, it always depends on what you want to do, but that's a far
stretch of your wholesale rejection of using Google hits for any kind of
research:

"folks realize that using the "number of hits returned on google" is a
hilarious bad way to prove a point -- right?"

Thomas

--
thomas koenig, ph.d.
department of social sciences, loughborough university, u.k.
http://www.lboro.ac.uk/research/mmethods/staff/thomas/index.html



More information about the Air-L mailing list