[Air-L] Text Sample Size?
james at howison.name
Tue Aug 18 09:50:16 PDT 2009
Lots of useful responses so far. Just wanted to add that we've dealt
with a similar question in attempting to move from qualitative human
coding to natural language processing. It was useful for us to think
about the relationship between the phenomenon of interest and the
units of analysis.
ie. Do you have good theoretical (or prior empirical) reasons to
believe that the differences between men and women that you are
interested in vary with the number of words? If you are talking about
individual word choice then the number of words that you sample ought
to be relevant. If, though, the phenomena that you are interested in
is at a different level of analysis, say paragraph level or post
level, then your sample reasoning should match that. Perhaps the
differences are in post openings or closings?
Once you nail the unit of analysis question then you need to ask what
you know about the population distribution of the phenomena you are
interested in. For example if it shows up only once in every (approx)
1,000 words, then you'll need to sample enough 1,000 word units to
ensure that you have enough possible places that it might have shown
up (ie something like 300 x 1000) for the inferential logic to work.
It's also possible that you don't yet know the patterns of difference,
those might be what you are seeking to discover, although that would
seem to call for a qualitative phase. In that case a logic of
sufficiency (ie I've now seen enough examples, and I'm not seeing any
new types, usually called "exhaustion", in reference to concepts, not
the coder!) might help you determine when to stop coding. Of course
such a strategy means that the claims you can make are different (ie
this is a theory generative, not a theory testing, methodology). Once
that process is done you'll have a better idea of the likely
population distribution of your phenomena, which will then give you
insight into what sample size you'd need to test your theory.
<credibility information redacted ;>
On 17 Aug 2009, at 8:29 PM, Karyn Hollis wrote:
> Hi All--
> This is a newbie question. I am planning to do a quantitative data
> analysis to study blogs for gender differences in CMC. Are there
> rules for the size of samples? Would comparing male to female blog
> texts of a total of 50,000 words each be enough to claim statistical
> significance for any differences I find?
> Thanks for any advice,
> Karyn Hollis
> Villanova University
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> Join the Association of Internet Researchers:
More information about the Air-L