[Air-L] Software to extract content of Facebook & Twitter

Tim Libert tlibert at asc.upenn.edu
Wed Aug 27 15:41:59 PDT 2014


I once figured out a way to get the youtube to spit out all of the comments at once by tweaking an ajax request, but I can’t figure out how to do it again on quick inspection.  it was possible though, you may have to bribe your coder friend with some club mate. ;-)

shawn is totally on the mark re: rate-limiting, google in particular is strict on scraping/botishness; anything you set against their properties needs some built-in politeness, I think a random interval of 5-15 seconds was working for me at one time.

another way is to just pay somebody on amazon turk to copy/paste for you - could be the cheapest route time- and resource-wise and not requiring new tools.

- t

On Aug 27, 2014, at 5:41 PM, Shawn Walker <stw3 at uw.edu> wrote:

> Hi Anu,
> 
> To extract YouTube comments, consider TubeKit (http://tubekit.org/). I've used it in a few projects to extract YouTube video metadata, videos, and comment data with great success.
> 
> Another consideration with respect to the discussion of these tools for FB or Twitter data collection is to evaluate what APIs each tool uses. Depending on the API, you might only receive a small sample of data or may be rate limited with others. So, it's important to understand how any tool you use works and what implications or limitations that might have on your research.  Historical data is notoriously difficult to get -- purchasing historical data is an option, but adds in a new set of limitations too (deleted accounts, posts, URL decay, etc.).
> 
> These issues need to be discussed more openly and critically addressed. :)
> 
> --
> Shawn Walker
> PhD Candidate
> Information School
> University of Washington
> stw3 at uw.edu — students.washington.edu/stw3
> SoMe Lab @ UW - somelab.net
> 
> ________________________________________
> From: Air-L <air-l-bounces at listserv.aoir.org> on behalf of Harju Anu <anu.harju at aalto.fi>
> Sent: Wednesday, August 27, 2014 11:26 AM
> To: Tim Libert
> Cc: air-l at listserv.aoir.org
> Subject: Re: [Air-L] Software to extract content of Facebook & Twitter
> 
> Hi everyone,
> 
> and I'm also grateful for all these suggestions for various tools. For a paper for my PhD I'm looking at YouTube comment threads and I was wondering if any one of you might know a tool that can extract those? It's a very laborious process to do manually and it drives me insane. I once asked a coder friend of mine, but he said it was more complicated than he initially thought, and we left it at that.
> 
> Thank you in advance, and thanks for a great list! I've been a lurker for quite some time now and find it very useful.
> 
> Best,
> Anu
> 
> 
> Anu Harju
> Doctoral Candidate
> Aalto University
> Helsinki
> Finland
> 
> Sent from my iPhone
> 
> On 27.8.2014, at 18.06, "Tim Libert" <tlibert at asc.upenn.edu> wrote:
> 
>> I’d quickly point out two additional considerations when ingesting fb/twitter data:  1) APIs generally exclude ads (which are ‘targeted’) - so depending on what you want to study and/or model an API will never give you an accurate view of what users really see.  APIs are easy, but incomplete.  2) The trick with scraping content directly from the web is accounting for processing/executing javascript as that is how many pages pull content dynamically (there may also be other factors: redirects, iframes, canvas, etc).  If your tool (e.g. Python urllib,etc). can only access static HTML you will not be able to pull the content you want as you will be accessing instruction sets of how to dynamically render content rather than the actual content.  I am not sure how your tool in R works, but I imagine this is a likely issue you may be facing.  I have developed some software that solves problem #2 by leveraging http://phantomjs.org/, but it’s not ready for public release quite yet; however, you may want to consider using an automation framework like selenium (http://www.seleniumhq.org/).
>> 
>> - tim, phd student, upenn
>> _______________________________________________
>> The Air-L at listserv.aoir.org mailing list
>> is provided by the Association of Internet Researchers http://aoir.org
>> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>> 
>> Join the Association of Internet Researchers:
>> http://www.aoir.org/
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> 
> Join the Association of Internet Researchers:
> http://www.aoir.org/




More information about the Air-L mailing list