[Air-l] AOL Releases Search Logs from 500,000 Users
Jim Jansen
jjansen at ist.psu.edu
Wed Aug 9 07:26:35 PDT 2006
Hi!
As researcher who has employed search engine transaction logs in research projects for nearly a decade, the concerns about the AOL data release are out of proportion to reality. Note, from the example in the NYT story, that even with 3 months of query data, including geographical data, the reporter wasn't sure that this was the person. (BTW, the reporter, Saul Hansell, obviously didn't mind publishing the lady's queries for the entire world to see -- with her name. I hope he adequately explained to the lady the ramifications of what she was agreeing to.)
It is VERY difficult using just query terms to identify a particular searcher, which is why researchers have been struggling with personalization for nearly two decades. In the DOJ vs Google case, which is mentioned in the story, Google had to provide the queries to the DOJ (a statistically significant sample of about 5,000 instead of the larger number the DOJ was asking for). The privacy concerns were weighted against other factors, which is what we, as researchers, should be doing here.
There is no other way to get real world interaction data from a significant sample of Web users unless the search engine companies provide it to academic researchers. Many search engine company provide and have provided this type of data (including Excite, AltaVista, AlltheWeb, Lyco, AOL, Yahoo!, MSN, and Google, among others -- they all do it). Many search engine companies post it on their Web pages, provide it to researchers, the government, or sell it to commercial research companies.
Are there potential privacy concerns with such data release? Yes. Are there potentially great benefits with such data release? Yes.
A good road ahead for the research community is to work on ways to preserve privacy in such data releases and provide a balanced voice in these debates.
Best,
Jim
**************************************
Jim Jansen
Email: jjansen at acm.org
URL: http://ist.psu.edu/faculty_pages/jjansen/ <https://mail.ist.psu.edu/exchweb/bin/redir.asp?URL=http://ist.psu.edu/faculty_pages/jjansen/>
Blog: http://jimjansen.blogspot.com/ <https://mail.ist.psu.edu/exchweb/bin/redir.asp?URL=http://jimjansen.blogspot.com/>
Phone: 814-865-6459 Fax: 814-865-6426
College of Information Sciences and Technology
The Pennsylvania State University
329F Information Sciences and Technology Building
University Park, PA, 16802, USA
**************************************
________________________________
From: air-l-bounces at listserv.aoir.org on behalf of Jennifer Stromer-Galley
Sent: Wed 8/9/2006 9:48 AM
To: air-l at listserv.aoir.org
Subject: Re: [Air-l] AOL Releases Search Logs from 500,000 Users
The New York Times has an article online today about the AOL release of data.
You can find the article at
http://www.nytimes.com/2006/08/09/technology/09aol.html. (NyTimes.com requires
registration).
The article highlights one woman, a 60 year old from Georgia, whose searches
were captured in the three month period. Much is revealed about her in her
search queries . . . It also discusses the release of the data. AOL
spokespeople are saying they did not authorize the release - that an employee
acted hastily and without authorization to release it. The article also
reports that programmers have set up Web sites to let people search the data
in the database, which is leading people to find shocking or amusing search
histories.
Eeeek.
Best,
~Jenny
--
Assistant Professor
Department of Communication, SS 340
University at Albany, SUNY
1400 Washington Ave.
Albany, NY 12222
518-442-4873
jstromer at albany.edu
http://www.albany.edu/~jstromer
_______________________________________________
The air-l at listserv.aoir.org mailing list
is provided by the Association of Internet Researchers http://aoir.org <http://aoir.org/>
Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
Join the Association of Internet Researchers:
http://www.aoir.org/
More information about the Air-L
mailing list