Thu Jun 18 16:16:45 PDT 2009

hey all,

part of the research I am doing requires that I identify threads on a 
listserv for analysis. Threads consist of emails that are a series of 
responses to an initial email.

of course the easiest way to do this is to sort emails by subject 
line. however as you might know this is not complete as, for example, 
some participants will change the subject for a variety of reasons 
while still remaining in the same thread. Thus one could analyze info 
in the email header to identify threads, but in my case this data is 
not always available. Alternatively, one could manually scan though 
the text of the emails - which is very time consuming when using a 
large email corpus.

Therefore, what I need is a method (preferably automated) that can 
identify email threads by looking at the texts of the emails. I can 
imagine some software that does this and can create clusters of 
emails based on semantic similarities that I could equate to threads 
- but I haven't been able to identify any just yet...

the units of analysis that I have described are fairly common and, I 
imagine, so is my problem. Thus perhaps people on this list can point 
me to existing methods/software/papers that have already addressed this issue?


Dhanaraj Thakur
Ph.D. Candidate
School of Public Policy
Georgia Institute of Technology

