[Air-l] counting google hits

elijah wright elw at stderr.org
Wed Mar 2 17:21:28 PST 2005


>> folks realize that using the "number of hits returned on google" is a 
>> hilarious bad way to prove a point -- right?
>
> Wrong. What's wrong with using the vast internet resources as a 
> quasi-corpus for natural languages (if you avoid certain pitfalls, which 
> I alluded to in my last message)?

because people assume that all texts that are available are represented, 
which according to the google people they are *not*.

in other words, the sample that you are pulling numbers from is neither 
complete nor perfect - so your results won't be either.

do you understand what google does well enough (details of the algorithm, 
et cetera) to know what the weaknesses are?  oh, you say they haven't 
published enough information for you to know?  that's what i thought.  :|


> I am afraid, this is how your argumentation sounds to me. Why should it 
> be wrong to use the number of google hits under all circumstances?

i think your tone is pretty crass.


> If I want to show that Canada is better known than Vanuatu 
> (http://googlefight.com/index.php?lang=en_GB&word1=canada&word2=vanuatu), 
> why would the comparison of google hits be inadmissable? (There are a 
> number of reasons, why the "Vunuatu" hits are inflated, but that is of 
> no concern here).

popularity of a term is one of the few instances in which comparative 
occurrence vis a vis the google corpus *might* be useful.  it would depend 
on your question, and whether the data available from the particular 
google server you're connected to is appropriate to answering it.


--elijah



More information about the Air-L mailing list