[Air-L] Text Sample Size?

Tue Aug 18 10:50:58 PDT 2009

Excellent thread -

In terms of resources, you might wish to look at the work of Susan  
Herring, especially her content analyses of weblogs.  Additionally,  
papers presented at the SIGIR, TREC (Blog track) and ICWSM  
conferences, and the journals JASIST and IP&M may have useful  
methodology segments.  There are probably lots of other useful venues.

Following on James' excellent comments, I would urge you to think  
about this analysis on the observation-level, rather than overall  
corpus size.  Let's assume you have 200 chunks of 1000-word text from  
a random collection of blogs (100 female-gendered blogs, 100 male- 
gendered blogs).  You could then have raters apply a subjective scale  
to the text, and then you could compare scale responses between the  
groups, looking for statistically significant differences.

With 200 observations, it would be safe to assume that your data was  
parametric, and use standard t-tests or ANOVA.  However, if you had  
fewer observations, nonparametric methods such as Wilcoxon and Kruskal- 
Wallis would be applicable.  With these tests you're not drawing  
general, population-level inference, but this will allow you to run  
comparisons in your data set.

If you are looking for population-level statistical significance, this  
study lends itself to a stratified design.  The first stage of  
sampling could be from a public listing of weblogs (finite population)  
or from a randomized search (infinite population).  The second stage  
of sampling would be text chunks of appropriate size.  Depending on  
gender distribution you may need to apply weighting within your  
sample.  Importantly, you would be able to calculate standard errors  
with this design.

Best,
Fred
(Also of limited credibility)

On Aug 18, 2009, at 12:50 PM, James Howison wrote:

> Lots of useful responses so far. Just wanted to add that we've dealt  
> with a similar question in attempting to move from qualitative human  
> coding to natural language processing.  It was useful for us to  
> think about the relationship between the phenomenon of interest and  
> the units of analysis.
>
> ie. Do you have good theoretical (or prior empirical) reasons to  
> believe that the differences between men and women that you are  
> interested in vary with the number of words?  If you are talking  
> about individual word choice then the number of words that you  
> sample ought to be relevant.  If, though, the phenomena that you are  
> interested in is at a different level of analysis, say paragraph  
> level or post level, then your sample reasoning should match that.   
> Perhaps the differences are in post openings or closings?
>
> Once you nail the unit of analysis question then you need to ask  
> what you know about the population distribution of the phenomena you  
> are interested in.  For example if it shows up only once in every  
> (approx) 1,000 words, then you'll need to sample enough 1,000 word  
> units to ensure that you have enough possible places that it might  
> have shown up (ie something like 300 x 1000) for the inferential  
> logic to work.
>
> It's also possible that you don't yet know the patterns of  
> difference, those might be what you are seeking to discover,  
> although that would seem to call for a qualitative phase.  In that  
> case a logic of sufficiency (ie I've now seen enough examples, and  
> I'm not seeing any new types, usually called "exhaustion", in  
> reference to concepts, not the coder!) might help you determine when  
> to stop coding.  Of course such a strategy means that the claims you  
> can make are different (ie this is a theory generative, not a theory  
> testing, methodology).  Once that process is done you'll have a  
> better idea of the likely population distribution of your phenomena,  
> which will then give you insight into what sample size you'd need to  
> test your theory.
>
> Cheers,
> James
> <credibility information redacted ;>
>
> On 17 Aug 2009, at 8:29 PM, Karyn Hollis wrote:
>
>>
>>  Hi All--
>>  This is a newbie question.  I am planning to do a quantitative data
>>  analysis to study blogs for gender differences in CMC.  Are there  
>> any
>>  rules for the size of samples?  Would comparing male to female blog
>>  texts of a total of 50,000 words each be enough to claim statistical
>>  significance for any differences I find?
>>  Thanks for any advice,
>>  Karyn Hollis
>>  Villanova University
>> _______________________________________________
>> The Air-L at listserv.aoir.org mailing list
>> is provided by the Association of Internet Researchers http:// 
>> aoir.org
>> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>>
>> Join the Association of Internet Researchers:
>> http://www.aoir.org/
>
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>

--
Fred Stutzman
Ph.D. Student and Teaching Fellow
School of Information and Library Science, UNC-Chapel Hill
fred at fredstutzman.com | (919) 260-8508 | http://fredstutzman.com/