[Air-L] Analyzing Google Groups?

Jerom Janssen jfjanssen at gmail.com
Sun Nov 22 13:19:19 PST 2009

Dear Claudia, (and other members of Air-L,)

If it not absolutely required that the data you want to analyze come from
Google Groups, but that similar data on people discussing on the Internet in
a non-face-to-face manner could also be of interest, then maybe the
following helps:

Although there seems to be a Google Groups API (see
something that might be useful; I don't know for sure if this if
or even "official" in the sense that it stems from Google or is endorsed by
them), I don't know of a way to neatly extract messages (or full threads /
discussions, therefore) using this API. (An API is a software-interface to
communicate with other programs, e.g. Google Groups, which you can use in
your own programs.) A couple of years ago, we were planning on extracting
messages and especially entire threads / discussions from Google Groups, but
reading Google's Terms of Service (ToS) I found our plans were in
contradiction with them. Maybe this has changed by now, but I would be
surprised if that is were the case.

At the time there were, in my opinion, three options:

[1] - Abandon our research idea;

This was of course not attractive, and I of course dismissed this option.

[2] - Try to write a web-crawler to extract the messages;

This was also a breach of Google's ToS. Another disadvantage was that we
wanted data other people could use/access for analysis as well, so that
others with an interest in this field would be able to do similar/related
analyses using the same or similar data. Both points were thought to be
sufficient reasons for rejecting this approach. We did try to obtain
Google's permission, (which we did get) but the latter point remained:
Google would not allow us to distribute our data set if we could build our
own HTML-scraper to extract it from their web-pages, so other researchers
would have no access to the data.

[3] - Instead of sourcing data from Google Groups, use an internet
provider's (ISP) access to Usenet messages;

Because we wanted to see if cultural differences in the ways people
communicate when face-to-face communication were existent when people use a
digital, text-only medium (Usenet newsgroups) we also looked at plain Usenet
messages. These can be obtained in plain-text format using a program like
"inn" (which runs under Linux, I don't know about Windows/Mac environments).
This program can store all messages / threads of your choosing in a
directory structure, and it can update this tree at a specified interval,
e.g. every 24 hours. This results in a directory structure filled with
plain-text files which you can parse in situ, and/or load into some other
form of database for analysis. You can extract all kinds of interesting
information from these files, for more information / ideas see the offical
Usenet (NNTP) communication protocol for more information:

Although this can be a bit of work to set up, it can be very rewarding: a
still growing vast database of data created by a great number of
contributors. Another advantage is that their contributions are precisely
time-stamped. And, of course, other researchers can access this publicly
available data.
There are disadvantages, too. For instance, it may be a little complicated
to set up. (Probably best to leave this to a sysadmin.) Also, Usenet is
(ab)used for swapping copyrighted materials such as movies & music, and it
is also used for things like distributing (child!)pornography. By excluding
certain groups and especially the so-called "binaries" (binary data files,
i.e. not consisting of plain-text) you can avoid such unwanted discussion
topics. Bear in mind that you and/or your institution can get in trouble
when downloading such things, even if you do this in an automated fashion.
By being precise about this when configuring, you can have access to a
fascinating and largely untapped source of data.

I hope this helps. If not, maybe it helps us Air-L-members if you can
provide some specifics about what it is you want to use / do?

Jerom Janssen

On Sat, Nov 21, 2009 at 16:46, Claudia Mueller-Birn <clmb at cs.cmu.edu> wrote:

> Dear all,
> I am interested in doing some research using data from Google Groups;
> ideally I'd like to have the group archive in mbox or other parseable
> format. I can't imagine I am the first person who wants to do this and I am
> wondering if anyone has any tips or ideas.
> Thank you.
> :::Claudia
> Claudia Mueller-Birn | Post-doctoral Fellow/Alexander von Humboldt
> Fellowship Researcher | Carnegie Mellon University | Institute for Software
> Research (ISR) | 5000 Forbes Avenue Pittsburgh, PA 15213 | phone: (412) 268
> 6367 | mail: clmb at cs.cmu.edu
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> Join the Association of Internet Researchers:
> http://www.aoir.org/

More information about the Air-L mailing list