[Air-L] Text Sample Size?

James Howison james at howison.name
Tue Aug 18 09:50:16 PDT 2009

Lots of useful responses so far. Just wanted to add that we've dealt  
with a similar question in attempting to move from qualitative human  
coding to natural language processing.  It was useful for us to think  
about the relationship between the phenomenon of interest and the  
units of analysis.

ie. Do you have good theoretical (or prior empirical) reasons to  
believe that the differences between men and women that you are  
interested in vary with the number of words?  If you are talking about  
individual word choice then the number of words that you sample ought  
to be relevant.  If, though, the phenomena that you are interested in  
is at a different level of analysis, say paragraph level or post  
level, then your sample reasoning should match that.  Perhaps the  
differences are in post openings or closings?

Once you nail the unit of analysis question then you need to ask what  
you know about the population distribution of the phenomena you are  
interested in.  For example if it shows up only once in every (approx)  
1,000 words, then you'll need to sample enough 1,000 word units to  
ensure that you have enough possible places that it might have shown  
up (ie something like 300 x 1000) for the inferential logic to work.

It's also possible that you don't yet know the patterns of difference,  
those might be what you are seeking to discover, although that would  
seem to call for a qualitative phase.  In that case a logic of  
sufficiency (ie I've now seen enough examples, and I'm not seeing any  
new types, usually called "exhaustion", in reference to concepts, not  
the coder!) might help you determine when to stop coding.  Of course  
such a strategy means that the claims you can make are different (ie  
this is a theory generative, not a theory testing, methodology).  Once  
that process is done you'll have a better idea of the likely  
population distribution of your phenomena, which will then give you  
insight into what sample size you'd need to test your theory.

<credibility information redacted ;>

On 17 Aug 2009, at 8:29 PM, Karyn Hollis wrote:

>   Hi All--
>   This is a newbie question.  I am planning to do a quantitative data
>   analysis to study blogs for gender differences in CMC.  Are there  
> any
>   rules for the size of samples?  Would comparing male to female blog
>   texts of a total of 50,000 words each be enough to claim statistical
>   significance for any differences I find?
>   Thanks for any advice,
>   Karyn Hollis
>   Villanova University
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> Join the Association of Internet Researchers:
> http://www.aoir.org/

More information about the Air-L mailing list