[Air-L] substantive solutions -- Re: Criteria for proving that online data (especially forumcomments) are real?

Sat Jul 28 04:33:32 PDT 2012

I was pleased to see this question posed to the group. I've been doing research utilizing a web-based national survey, offering a $25 Amazon.com electronic gift code upon completion. The only identifying information respondents are asked to actively provide us is their email address for receipt of the code. We do not use "panels" of participants. We do not screen them by telephone. We do not have prior email exchanges with assignment of unique ID codes.

I completed what I believe to be an overly exhaustive lit review, contacted others around the country doing similar work, etc. I continue to hear of researchers landing big grants after throwing around Facebook/Twitter-like buzz language, they collect data, and then cannot publish their results and analyses because journal editors now ask, "How can you explain that you were able to establish the 'uniqueness' of each respondent as an individual rather than a repeat responder?" In other words, can you validate your data as being a sample of 100 individuals versus one person who responded 100 times with junk data?

We learned the web survey service to which we subscribe has the ability to gather a number of what I call "paradata" about the survey site visitors. 

I revised the welcome page of the survey to include a privacy statement informing the research subject that if they continue with the survey by proceeding to the next page, their IP address, GeoIP information, web browser UserAgent string, site referrer, etc. will be collected.

After the survey's initial screening questions, if the respondent qualifies, they are presented with an electronic informed consent document. The consent form explains in very basic language that anyone who access the survey by altering their internet connection in any way that might obscure or misrepresent their online presence will be deemed ineligible. We went several rounds with this back and forth with our IRB on this. In the end, they were quite pleased and have begun referring other researchers my way who struggle with similar issues.

Our primary problem was this: the web survey could block duplicate respondents based on IP address as well as using cookies. However, we were taken aback at the huge number of proxy users --- some individuals took the survey over 30 times in less than a day, providing 30 different email addresses!

After implementing the privacy page and revising the consent form, I created an algorithm for weeding out the vast majority of proxy users, in addition to people who know they can simply unplug their cable modem for a few hours, plug it back in, and be re-assigned a new dynamic IP from their ISP. Aside from the high prevalence of "cheaters," I found a wide range of sophistication and motivations behind the techniques people used. For example, there is "marketing" (aka spam) software that can auto-generate a few dozen free email accounts with the press of a button, even configuring them all to forward to the same address, and each of them sounding reasonably realistic like "johndoe631 at gmail.com." There are web sites selling subscriptions to "professional survey takers." The customer subscribes, and the web site gives them a list of surveys scraped from search engines, offering some kind of reward or incentive. I believe one of the sites was something like "swagbucks.com."

I've written an article that serves as a methods paper to describe and outline this process of eliminating false data, validating research subjects, and study design that allows for the ability to ethically disqualify those who complete surveys over and over. It had the potential to drain our study's budget and leave us with nothing we could use., I included a case study of how we worked with our IRB in responding to someone who actually complained and demanded their incentive after taking the survey 15 times. It's a tricky game, substantiating their ineligibility and deception without teaching them how they were detected and prompting them to find other security vulnerabilities. The article focuses on the  more technical aspects of the process, judgment calls, and the algorithm --- not just a simple statistical formula. In summary, I suggest Internet researchers must begin to think like hackers, which I happen to be. In the process of polishing the paper to submit for publication, I'm struggling with what journals I want to target. The survey itself is related to HIV and cancer, so I could go that route, something methodological, or techie, or more obscure like "The Journal of Medical Internet Research," and so on.

I'm working on a second paper that addresses the many ethical aspects and considerations for collecting data from Internet research subjects, sometimes without their knowledge or consent, for purposes of data validity. For example, IP addresses are among the list of HIPAA-defined identifiers of personal health information.

Any suggestions for publication submission, or how to better frame the piece within a particular discipline or field of study, would be greatly appreciated.

- Michael Scarce

_________________________________
Michael Scarce
Michael.Scarce at ucsf.edu
Research Specialist
UCSF Division of Infectious Diseases / 
Center for AIDS Prevention Studies
50 Beale Street, Suite 1300
San Francisco, CA  94105 

   phone 	(415) 597-4979
   fax 	(415) 597-9213

On Jul 26, 2012, at 2:01 PM, Marj Kibby wrote:

> Is there additional onus of proof for web based material? How do you prove that survey responses are real, that notes from interviews are real ....
> 
> The verification is in how you frame the project and write up the results.
> 
> Marj
> 
> 
> 
> 
> 
> Associate Professor Marjorie Kibby, B.Ed, M.A, Ph.D, FHERDSA
> Director, Student Experience FEDUA
> Head of Discipline: Film, Media and Cultural Studies
> School of Humanities and Social Science
> The University of Newcastle  Callaghan NSW 2308 Australia
> Marj.Kibby at newcastle.edu.au
> +61 2 49216604
>>>> Maria Eronen <m85327 at student.uwasa.fi> 27/07/12 6:24 AM >>>
> Hi,
> 
> Would anyone know what are the criteria concerning internet material's  
> validity? Is it common that you as a researcher will be asked to prove  
> that the material you have collected from the internet is real. Since  
> a lot of internet data disappears everyday, mere url addresses are not  
> enough. Even html files can be modified after saving webpages.
> 
> I would appreciate if someone had time to answer.
> 
> Maria
> 
> 
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> 
> Join the Association of Internet Researchers:
> http://www.aoir.org/
> 
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> 
> Join the Association of Internet Researchers:
> http://www.aoir.org/