[Air-l] Question re Size of Data Set...

Andy Williamson andy at wairua.co.nz
Thu Jan 25 14:13:29 PST 2007

Hi Matthew

Doesn't this seem like a 'how long is a piece of string' question! However,
to a great extent, the answer lies in your methodology. You say you're using
a "ground-theory approach", I presume you mean grounded theory (darn that
keyboard)... In which case the key issue here is that you are looking for

It is impossible to predict when this will occur in advance and so the
correct answer to your question is 'don't know' - you need to begin your
analysis. I think anyone who's done GTM kind of gets a gut feel for whether
what they have is enough once they start looking at and becoming immersed in
the data. 

You've identified a sub-set of the message board postings (that's a good
start as its at least more manageable) and your analysis will emerge some
categories, processes and attributes. If you find these becoming saturated
with the data you have, you're (probably) doing ok. If you get to the end
and you're still discovering new stuff, then I'd suggest you might want to
go back and extend the data set.

As for good sources, your issue here is methodology, less so the internet
(but it is interesting to see how GTM is used in similar studies) - I would
suggest you get clear about GTM, which version you are using and why and
this will help you no end. One of the best discipline areas for writings on
grounded theory is actually nursing research (I kid you not!).

Good luck

-----Original Message-----
From: air-l-bounces at listserv.aoir.org
[mailto:air-l-bounces at listserv.aoir.org] On Behalf Of Matthew Pearson
Sent: Thursday, January 25 2007 12:32
To: air-l at listserv.aoir.org
Subject: [Air-l] Question re Size of Data Set...

Hello All:

This is my debut post to this excellent list after a long time spent reading
lots of good stuff from others.

I have a question re the size of my data set for my dissertation
project: Is my data set too large, too small, or just right?

I'd very much appreciate any insight/ideas/feedback anyone has about this.
My advisor and I aren't that sure about this issue, and I haven't been able
to discern much about this issue from a lot of the studies I've read.

I do realize my question is thus far meaningless without knowing anything
about my project, so here's some more information/background:

I'm doing a close look at one message board community--one devoted to
discussion of a particular college basketball team.  I've got all sorts of
things I'm interested in, but my central research question has to do with
the ways that people teach each other and learn from  
one another the conventions for discourse in/on the message board.   
(I'm also interested in potential emerging genres of writing, the influence
of sports fandom on online literacy practices, and perhaps even examining
issues related to gender (which I realize is a pretty general thing to say,
but I'll keep it at that for now).)

I've got two main sources of data: both (1) archived threads/posts from the
message board, and (2) online questionnaires that participants/members
filled out.  My question concerns source (1)-- the archival data.

I have tons of data archived.  I used one of those "site-sucker"  
programs to grab all the discussions on the message board over about a 8
month period of time.  Given that this message board is a pretty busy one
and that I'm using a ground-theory approach to the data analysis, I chose to
sample a smaller set of the overall data.  I used an "event sampling" method
and, with input from posters on the message board, chose 5 "big" events
around which to sample discussion.  I then also chose 5 other events that
occurred during the months I archived discussion that were not listed as
"big" events by anyone who offered their sense of the "big" events.  I
didn't, though, choose just those threads of discussion related to those
"big" and non-big events, but rather used those as anchoring moments in
time, and then sampled ALL discussion that occurred on those dates, and one
day prior and one day later.  This resulted in such a large data set that I
ended up using only 3 "big" events and 3 non- big ones, and then sampling
for those dates, and the days immediately around them.

What I'm left with now is about 4000 individual .html pages, some of which
have fairly detailed threads of discussion, with sizable individual posts,
and also, of course, many of which that have cursory, short sentences that
perhaps look more like "chat."  This is a lot of stuff to wade through, yet
it does represent only 18 days of life on this message board.  Thus far I've
been going through the data in separate "passes," looking for answers to
particular aspects of my research question, and it's a daunting thing.  I
know research takes a lot of work and time, but I thought it wise to get
feedback to see if I'm going overboard here.

So does my sample sound reasonable?  I'm well aware that the way I sample
will directly impact the kinds of conclusions  can draw and level of rigor
folks see in my work.

Any thoughts?  Good sources re this kind of methodology?  I've got Virtual
Methods Ed. by Hine, among other sources, and haven't seen anything yet re
sample size.  Maybe I missed it somehow?

many thanks,

Matthew Pearson
mdpearson at wisc.edu
PhD Candidate, University of Wisconsin Department of English-- Composition
and Rhetoric; Research Assistant, UC-Irvine Writing Project; & Man on the

The air-l at listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at:

Join the Association of Internet Researchers: 

More information about the Air-l mailing list