[Air-L] Software to extract content of Facebook & Twitter

Harju Anu anu.harju at aalto.fi
Thu Aug 28 00:14:45 PDT 2014


Hi everyone,

thank you all so much for all the suggestions! I will try them out as soon as I have the time. I suppose they work on Mac, too.

Noha, thanks for the offer of help, I might take you up on that if I run into any problems. I'm flying out to a conference today so won't be able to do anything in this regard for a week, but thanks again, much appreciated  :)

Best,
Anu

Sent from my iPhone

On 28.8.2014, at 8.57, "Noha Nagi" <noha.a.nagi at gmail.com<mailto:noha.a.nagi at gmail.com>> wrote:

Hi Anu,

I suggest you try  NodeXL<http://nodexl.codeplex.com/>. It's simple and free. You will need to install first the social network importer<http://socialnetimporter.codeplex.com/> for NodeXL to grab facebook, twitter, flicker and youtube data.

Good Luck !


On Wed, Aug 27, 2014 at 9:26 PM, Harju Anu <anu.harju at aalto.fi<mailto:anu.harju at aalto.fi>> wrote:
Hi everyone,

and I'm also grateful for all these suggestions for various tools. For a paper for my PhD I'm looking at YouTube comment threads and I was wondering if any one of you might know a tool that can extract those? It's a very laborious process to do manually and it drives me insane. I once asked a coder friend of mine, but he said it was more complicated than he initially thought, and we left it at that.

Thank you in advance, and thanks for a great list! I've been a lurker for quite some time now and find it very useful.

Best,
Anu


Anu Harju
Doctoral Candidate
Aalto University
Helsinki
Finland

Sent from my iPhone

On 27.8.2014, at 18.06, "Tim Libert" <tlibert at asc.upenn.edu<mailto:tlibert at asc.upenn.edu>> wrote:

> I’d quickly point out two additional considerations when ingesting fb/twitter data:  1) APIs generally exclude ads (which are ‘targeted’) - so depending on what you want to study and/or model an API will never give you an accurate view of what users really see.  APIs are easy, but incomplete.  2) The trick with scraping content directly from the web is accounting for processing/executing javascript as that is how many pages pull content dynamically (there may also be other factors: redirects, iframes, canvas, etc).  If your tool (e.g. Python urllib,etc). can only access static HTML you will not be able to pull the content you want as you will be accessing instruction sets of how to dynamically render content rather than the actual content.  I am not sure how your tool in R works, but I imagine this is a likely issue you may be facing.  I have developed some software that solves problem #2 by leveraging http://phantomjs.org/, but it’s not ready for public release quite yet; however, you may want to consider using an automation framework like selenium (http://www.seleniumhq.org/).
>
> - tim, phd student, upenn
> _______________________________________________
> The Air-L at listserv.aoir.org<mailto:Air-L at listserv.aoir.org> mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
_______________________________________________
The Air-L at listserv.aoir.org<mailto:Air-L at listserv.aoir.org> mailing list
is provided by the Association of Internet Researchers http://aoir.org
Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers:
http://www.aoir.org/



--
Noha A.Nagi



More information about the Air-L mailing list