[Air-L] Software to extract content of Facebook & Twitter

craig boman craig.boman at gmail.com
Wed Aug 27 13:46:06 PDT 2014


Depending on your coding knowledge, you may be able to configure a screen
scraper like Scrapy (http://doc.scrapy.org/en/latest/) to get what you
need. I don't have much experience with it yet, but it is open source.

All the best,
Craig Boman
Ph.D Student


On Wed, Aug 27, 2014 at 2:26 PM, Harju Anu <anu.harju at aalto.fi> wrote:

> Hi everyone,
>
> and I'm also grateful for all these suggestions for various tools. For a
> paper for my PhD I'm looking at YouTube comment threads and I was wondering
> if any one of you might know a tool that can extract those? It's a very
> laborious process to do manually and it drives me insane. I once asked a
> coder friend of mine, but he said it was more complicated than he initially
> thought, and we left it at that.
>
> Thank you in advance, and thanks for a great list! I've been a lurker for
> quite some time now and find it very useful.
>
> Best,
> Anu
>
>
> Anu Harju
> Doctoral Candidate
> Aalto University
> Helsinki
> Finland
>
> Sent from my iPhone
>
> On 27.8.2014, at 18.06, "Tim Libert" <tlibert at asc.upenn.edu> wrote:
>
> > I’d quickly point out two additional considerations when ingesting
> fb/twitter data:  1) APIs generally exclude ads (which are ‘targeted’) - so
> depending on what you want to study and/or model an API will never give you
> an accurate view of what users really see.  APIs are easy, but incomplete.
> 2) The trick with scraping content directly from the web is accounting for
> processing/executing javascript as that is how many pages pull content
> dynamically (there may also be other factors: redirects, iframes, canvas,
> etc).  If your tool (e.g. Python urllib,etc). can only access static HTML
> you will not be able to pull the content you want as you will be accessing
> instruction sets of how to dynamically render content rather than the
> actual content.  I am not sure how your tool in R works, but I imagine this
> is a likely issue you may be facing.  I have developed some software that
> solves problem #2 by leveraging http://phantomjs.org/, but it’s not ready
> for public release quite yet; however, you may want to consider using an
> automation framework like selenium (http://www.seleniumhq.org/).
> >
> > - tim, phd student, upenn
> > _______________________________________________
> > The Air-L at listserv.aoir.org mailing list
> > is provided by the Association of Internet Researchers http://aoir.org
> > Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >
> > Join the Association of Internet Researchers:
> > http://www.aoir.org/
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>



More information about the Air-L mailing list