[Air-L] Software to extract content of Facebook & Twitter

Tim Libert tlibert at asc.upenn.edu
Wed Aug 27 08:06:27 PDT 2014


I’d quickly point out two additional considerations when ingesting fb/twitter data:  1) APIs generally exclude ads (which are ‘targeted’) - so depending on what you want to study and/or model an API will never give you an accurate view of what users really see.  APIs are easy, but incomplete.  2) The trick with scraping content directly from the web is accounting for processing/executing javascript as that is how many pages pull content dynamically (there may also be other factors: redirects, iframes, canvas, etc).  If your tool (e.g. Python urllib,etc). can only access static HTML you will not be able to pull the content you want as you will be accessing instruction sets of how to dynamically render content rather than the actual content.  I am not sure how your tool in R works, but I imagine this is a likely issue you may be facing.  I have developed some software that solves problem #2 by leveraging http://phantomjs.org/, but it’s not ready for public release quite yet; however, you may want to consider using an automation framework like selenium (http://www.seleniumhq.org/).

- tim, phd student, upenn


More information about the Air-L mailing list