[Air-L] Screen Scraping of URLS and WHOIS subject/category mining

Fri Feb 9 18:23:55 PST 2018

Hi Nathan,

regarding your second question I don't have any quick ideas. However, to scrape the urls for their category, beautifulsoup, a Python module, should be helpful. You would be able, for pages with the same structure at least (say like in your example cnn.com (http://cnn.com) news articles), to extract the information that you mention here.

I don't know about your background and skills in using Python. However, there is a workshop about this held regularly here at QUT DMRC and our summer schools by Patrik Wikström, who put the materials online in our GitHub repository: https://github.com/qut-dmrc/web-scraping-intro-workshop

It's aimed at people with no prior programming experience, even though that would be helpful to understand the materials on your own.

Hope that helps.

Cheers,

Felix

Felix Victor Münch
PhD Candidate @ QUT Digital Media Research Centre
Social Media: https://about.me/flxvctr
Google Scholar: https://scholar.google.com.au/citations?user=yn1Rz_EAAAAJ
Academia.edu: https://qut.academia.edu/FlxVctr
ResearchGate: https://www.researchgate.net/profile/Felix_Muench
ORCID: https://orcid.org/0000-0001-8808-6790
QUT: http://staff.qut.edu.au/staff/muench/
QUT preprints: https://eprints.qut.edu.au/view/person/M=FCnch,_Felix_Victor.html

> On Friday, Feb 09, 2018 at 5:45 pm, Nathan Stolero <stolero at gmail.com (mailto:stolero at gmail.com)> wrote:
> Dear AOIR's,
>
> I'm studying the information seeking behavior of adolescents, young adults
> and adults. One of subjects I'm investigating, is the difference between
> the URLS/Links users choose to use (navigate/browse to, click on, etc.) and
> the URLS/Links users tend to avoid (looking at them, deciding not to
> navigate/browse/click, using eye-tracking).
>
> As a result, I have a list of all the URLS the user visited during the
> experiment and a set of screenshots in which the avoided links are marked
> (I don't have the URLS because the user did not click on them, so the
> software did not save it). I have a question regarding these two lists:
>
> 1) Regarding the list of URLS -
> What can be the best way to mine a large lists of URLS for their category?
> Let's say - http://www.cnn.com with news/broadcasting/content. I tried
> WHOIS domains hoping to find this information, and then create a code that
> will mine this line for each link, but could not find something significant.
>
> 2) Regarding the screenshots -
> Is there a way, maybe using screen scraping, to automatically translate
> textual links (clickable headlines, for example) to their URLS? Maybe using
> a simple protocol of: a) Scrape the text in a marked area, b) search this
> text on google, c) Use the first URL?
>
> I hope I've made my intentions clear and looking forward for wisdom on the
> virtual crowd.
>
> Nathan
>
> *************************************************************
> Nathan Stolero
> Doctoral Student
> The Communication Department, The Faculty of Social Science
> Tel Aviv University
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/