[Air-L] A question for researchers interested in the basics of statistical inference

Fred Stutzman fred at fredstutzman.com
Thu Sep 3 09:02:52 PDT 2009


Hi Monica,

Congrats on your thesis.  I will take a stab at your questions.

I think what might be problematic is the conception of inference.  In  
inferential statistics, the base definition of inference is drawing  
inference about a larger population from a sampled data set.  In these  
cases, we the golden sampling method is SRSWR, though population- 
inferential statistics are commonly computed on SRS samples, cluster  
samples, multi-stage samples, and so on.

To produce unbiased estimates, inferential statistical methods have a  
set of assumptions.  OLS, for example, has a number of assumptions -  
IV's variation is not random, no multicollinearity, homoskedasticity,  
mean of residuals is zero.  Now, many of these assumptions are met  
when the sample that produced the data was a probability sample.  If  
the estimates are unbiased, you can calculate variance, standard  
errors and confidence intervals for the population.

Importantly, a sample does not require a random draw for valid  
inferential estimation.  If a purposive sample can meet the  
assumptions of an inferential model, you can certainly produce  
unbiased estimates.  However, gauging the degree of unbiasedness in a  
purposive sample is difficult, so it is unwise to assume true  
unbiasedness.  Let me focus on your question regarding the propriety  
of using inferential techniques on purposive samples.

A different, and complementary use of inferential statistics is to  
draw inferences about relations in data.  For example, to test  
differences between groups or the relations between many variables in  
an analysis.  In this case, the inference is not population-level;  
rather, it describes the relations in the population at hand.  In  
these cases, we cannot argue that our estimates are representative and  
unbiased, but many models are robust enough, and have enough  
diagnostic features, that we can generally gauge the validity of the  
measures.  In these cases, if we realize and report the limitations of  
the model, it is appropriate to use them.

Now, the second part of your question dealt with parametric and non- 
parametric methods.  In statistics, "parametric" is used to describe  
how the population fits to the parameters of a distribution.  In most  
cases, we are concerned with the normal distribution.  When the  
distribution of a population is non-parametric, it does not fit a  
particular distribution.  Often this happens in cases where our sample  
is quite small.  In this case a nonparametric method would apply.   
However, as populations grow larger, they tend to fit into  
distributions and distribution-appropriate methods would apply.

The application of inferential methods to non-probability samples is  
appropriate if the inferences are to be drawn within the sample, and  
the characteristics of the distributions reasonably meet the criteria  
of the method.  You generally should be careful when making claims  
outside of the sample (its representativeness) or to the degree of the  
un-biasedness, but you can use these techniques to make inferential  
estimates regarding the data at hand.

Finally, with regards to confidence levels, in a between-means  
comparison such as a t-test, we are comparing the hypothetical  
distributions of the groups, and the significance test provides our  
intervals for comparison.

Thanks,
Fred



On Sep 2, 2009, at 10:12 PM, Monica Barratt wrote:

> Hi everyone
>
> I'm currently writing up my thesis which has the working title  
> 'Researching
> the forums: Illicit drug use in a networked world'. I conducted an  
> online
> survey using a purposive (nonprobability) sample of illicit drug  
> users who
> used internet message boards (forums) to discuss or read about drugs.
> Originally I intended to conduct inferential statistics on this  
> sample of
> 915, as this is the general practice in many other papers I had  
> read. After
> some more thought though, I'm leaning away from that.
>
> Following is my thinking about this issue. I would really appreciate  
> some
> feedback on this from anyone with an interest in this areas (non  
> experts
> welcome too!)
>
> *My understanding of the sampling and statistical inference in my  
> thesis
> work*
>
> There are two types of samples: probability and nonprobability.  
> Probability
> samples occur when each individual from the population of interest  
> has an
> equal (non-zero) chance of being included in the sample (random  
> selection).
> In contrast, nonprobability samples contain self-selected  
> individuals from a
> population of interest - not everyone has a chance of participating,  
> so we
> can't calculate the relationship between the sample and the  
> population of
> interest.
>
> Probability samples of illicit drug users are rare. This is because to
> conduct a probability sample, the researcher needs to have a defined
> population, such as a list of students or phone numbers of households.
> Illicit drug use is a rare behaviour on a population level (excluding
> perhaps, ever use of cannabis) and it is unlikely that list of drug  
> users
> will exist given the illegality of the behaviour and reluctance to
> self-identify on such a list.
>
> Inferential statistics are not compatible with nonprobability  
> samples. A
> core assumption of the use of inferential statistics is that  
> individuals are
> randomly selected from the population of interest. Without this  
> randomness,
> the logic of inferential statistics does not hold.
>
> Inferential statistics can be further categorised into parametric and
> non-parametric statistical methods. These types of inferential  
> statistics
> are chosen depending upon the distribution of the variables to be  
> analysed;
> eg. parametric statistics for continuous normal variables and  
> nonparametric
> statistics for nonnormal or categorical/ordinal variables.
>
> Nonparametric or distribution free statistics are still inferential.  
> So they
> too are incompatible with nonprobability samples.
>
> Descriptive statistics can still be applied to nonprobability  
> samples to
> determine the relationships between variables in the dataset. What  
> should
> not be done is 'significance testing' as the aim of this testing is to
> determine whether a relationship is strong enough or a difference is  
> large
> enough, given the sample size, to be representative of a difference  
> in the
> population. This assumes that the sample has a known relationship to  
> the
> population. This is meaningless when applied to a nonprobability  
> sample.
>
> There are still good reasons to conduct a nonprobability sample.  
> There are
> simply situations when probability samples are impossible to obtain  
> or just
> too expensive (arguable this applies to my population of interest).  
> They are
> also useful in exploratory or preliminary studies (also relevant to  
> me). The
> trick is not to apply inappropriate statistical tests to data  
> collected in
> this way.
>
> Why is it then that we see probability statistics routinely  
> conducted upon
> nonprobability samples, especially in the drug studies field? Is it
> something about making our research appear more scientific with the  
> addition
> of a p < .05? Is it ignorance? Or do I have it wrong myself? Are  
> there times
> when inferential statistics, eg. a t-test or a correlation co- 
> efficient can
> be applied to nonprobability samples? Are there any exceptions to  
> this rule?
>
>
>
> -- 
> Monica Barratt
> BSc(Psych); PhD in progress...
> National Drug Research Institute
> Melbourne, Victoria, Australia
> http://preview.tinyurl.com/lwyyzq
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>


--
Fred Stutzman
Ph.D. Student and Teaching Fellow
School of Information and Library Science, UNC-Chapel Hill
fred at fredstutzman.com | (919) 260-8508 | http://fredstutzman.com/




More information about the Air-L mailing list