[Air-L] Twitter Data Sharing Update - Thou Shalt Not Share Collections of Tweets

Devin Gaffney itsme at devingaffney.com
Thu May 5 13:22:05 PDT 2011


Hey all,

So, yes, regardless of whether or not its "right" of them to block the distribution of raw catalogs of their data, they now completely disallow this activity. Obviously, this makes our job tougher, especially in the process of vetting/reviewing (and particularly since this stuff is so new and really needs that review in order to make compelling arguments) a paper that comes out with Twitter data - I'm just a fledgling academic, but my understanding is that generally you need the raw data backing assertions in order to really test those assertions. I don't know if there's a similar situation out there where assertions are made, but the data backing them cannot be made public. Either way, I think there's two separate motivations for doing this: 1. The need to make sure data doesn't fall into the wrong hands (particularly spambots/applications that have been blacklisted/other people/programs that cause harm to their environment and eventual valuation), 2. The need to control really the only real piece of value for Twitter, the potential demographic data - Facebook played it correctly by never opening it up and banking that their in-house work would be able to add value to their system, without relying on open data and programmers leveraging it to basically increase the value of the company. Now that Twitter has scaled up to this size, the benefits of open data are starting to be outweighed by the costs, and I think that's the big deal.

That said, I have talked with the people at Twitter, and they did agree that all analytical results from Twitter data can be let loose. We can imagine a situation where we have some raw set of data, then perform a vast battery of analytics on it, then open up those CSVs to the public - essentially, you could have the same effect of making all the raw data public, just broken up into these sets of analytical results (that is, someone could reverse engineer the analytics to get the catalog). This seems to be totally within TOS, and I've been working feverishly to get something out that allows us to collectively push data out this way. Basically, if we have a platform that easily allows us to hammer a dataset through 50 analytical processes, and the collection process for that dataset is very transparent, the collector algorithm is respected and understood, then we can sort of mitigate this problem (not the best solution, but a solution nonetheless).

And you're right, Michael, about the LoC 6 month lag time, insofar as I have heard as well. Also, the LoC collecting that data and allowing access to that data are entirely separate beasts - I'm sure they'll allow open access, but the details about that, none of which are announced, could turn out to be insurmountable for large-scale research. 

Devin

On May 5, 2011, at 8:59 AM, Stephen J Cavrak Jr wrote:

> Quoting Michael Zimmer <zimmerm at uwm.edu>:
> 
>> Stu-
>> 
>> ... that tweets are meant to be fleeting.
>> 
>> 
> 
> music
> once you play it
> is in the air
> gone forever ...
> 
> 
> 
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> 
> Join the Association of Internet Researchers:
> http://www.aoir.org/




More information about the Air-L mailing list