[Air-L] Comment scraping

Pete[r] Landwehr plandweh at cs.cmu.edu
Tue Jan 22 07:05:44 PST 2013


I will put in a plug for the painful-but-standard-and-entirely-free solution:

* scrape the comments using a free, command-line based program like
wget (http://www.gnu.org/software/wget/) or curl
(http://curl.haxx.se/)

* clean the text using the BeautifulSoup python package for parsing
HTML. (http://www.crummy.com/software/BeautifulSoup/)

...and I will put in a second, shameless plug for the text cleaning software put
out by the CASOS Center at CMU, AutoMap.
(http://www.casos.cs.cmu.edu/projects/automap/) While its main purpose
is for convert text data into network data, it incorporates a basic
HTML out-link scraper (probably not what you want), and a
remove-all-html-from-text-and-convert-symbols-to-english cleaner
(might be what you are looking for). This may not be what you need,
but if you are less familiar with coding solutions it should hopefully
help you get in the right direction.

Best,

pml


On Tue, Jan 22, 2013 at 8:54 AM, Casey Tesfaye <klt35 at georgetown.edu> wrote:
> Jacob, Jasmine, etc,,
>
> That software looks great! But expensive! I wonder if there is a cheaper
> alternative? (I'm working with the same type of data)
>
> Otherwise, we the multipronged approach has been the best I've encountered:
> text file + screen shot + html file
>
> Thanks,
> Casey
>
>
> On Mon, Jan 21, 2013 at 10:57 PM, Jacob Groshek <jgroshek at gmail.com> wrote:
>
>> I highly recommend Discovertext.  http://discovertext.com/
>>
>> Easy to use, good tech support if/when you need it.  Built in coding
>> system.  Also can export to spreadsheet (if necessary) with subscription.
>>
>> Best,
>>
>> Jacob
>>
>> --
>> Dr. Jacob Groshek
>> Assistant (Visiting) Professor
>> Digital Media and Research Methods
>> jgroshek.com <http://www.jgroshek.com/>
>>
>> Head, CTEC <http://aejmcctec.com/> / AEJMC <http://www.aejmc.org/>
>> Visiting Scholar, IAST <http://www.iast.fr/>
>> Full Member, NeSCoR <http://nescor.socsci.uva.nl/>
>>
>>
>>
>> On Tue, Jan 22, 2013 at 2:47 PM, Jasmine E McNealy <jemcneal at syr.edu>
>> wrote:
>>
>> > Hello All,
>> >
>> > I'm looking for ideas on the best software to use for comment scraping.
>>  I
>> > plan on doing quantitative content and qualitative textual analysis on
>> the
>> > comments connected to an article on an online pubulication.  The
>> > publication uses Disqus for comments, and ideally I'd like a program that
>> > would maintain the integrity of the comment relationships.  Any and all
>> > ideas are appreciated.
>> >
>> > Thanks,
>> >
>> > JM
>> >
>> > Jasmine McNealy
>> > Assistant Professor
>> > S.I. Newhouse School of Public Communication
>> > Syracuse University
>> > 215 University Place
>> > Syracuse, NY 13210
>> > 315-443-1151
>> > http://ssrn.com/author=1357319
>> > _______________________________________________
>> > The Air-L at listserv.aoir.org mailing list
>> > is provided by the Association of Internet Researchers http://aoir.org
>> > Subscribe, change options or unsubscribe at:
>> > http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>> >
>> > Join the Association of Internet Researchers:
>> > http://www.aoir.org/
>> >
>> _______________________________________________
>> The Air-L at listserv.aoir.org mailing list
>> is provided by the Association of Internet Researchers http://aoir.org
>> Subscribe, change options or unsubscribe at:
>> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>>
>> Join the Association of Internet Researchers:
>> http://www.aoir.org/
>>
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/



More information about the Air-L mailing list