[Air-L] Fwd: Re: Tool to convert Website to PDF

Rodrigo Davies rodrigo.davies at gmail.com
Sun Sep 7 16:33:19 PDT 2014


Hi Leonie,

If your goal is to input the data into qualitative analysis software, is
there a particular reason you need a PDF at all? If not, your best bet
would be to use a scraper tool like ScraperWiki and extract plain text:
https://scraperwiki.com/tools/code-in-browser

If you want to scrape regularly (e.g. track updates daily) SW takes the
hassle out of having to run it on your own server. If you do want to host
yourself and are familiar with Python at all, BeautifulSoup is a great
option: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

I used BeautifulSoup to write scrapers to collect data from ~6,000 pages
and it was super fast, reliable and required very little coding.

R

--
Rodrigo Davies
Doctoral Researcher / PhD student
Center for Work, Technology and Organizations
Stanford University
rodrigod at stanford.edu



On Sun, Sep 7, 2014 at 11:57 AM, Elijah Wright <elijah.wright at gmail.com>
wrote:

> ---------- Forwarded message ----------
> From: "Elijah Wright" <elijah.wright at gmail.com>
> Date: Sep 7, 2014 1:57 PM
> Subject: Re: [Air-L] Tool to convert Website to PDF
> To: "Leonie Tanczer" <ltanczer01 at qub.ac.uk>
> Cc:
>
>
> Something like this:
>
> https://sites.google.com/site/torisugari/commandlineprint2
>
> Having a browser engine print to file from cli is only one way to attack
> this.  My first inclination was to suggest piping the output from the lynx
> text browser through Ghostscript.  If you don't care about any of the
> visual affordances of technology on the web, that would be a heck of a lot
> quicker. (Eg, do you really just want the words?)
>
> You will likely want to use wget or similar to spider the site, then
> extract the list of urls,  then in a second pass use a heavier renderer
> (big fat firefox....) to convert to pdf.
>
> It's going to take a very long time, if you really have 10k+ pages.
>
> You might consider truncating the scope of your research to a narrower
> subset - or otherwise expect to become quite practiced at solving this
> style of problem.
>
> Asking the maxqda folks to enhance their software would be another way to
> proceed - it's not just you who runs into this perennial problem.  :-)
>
> --e
> On Sep 7, 2014 10:56 AM, "Leonie Tanczer" <ltanczer01 at qub.ac.uk> wrote:
>
> > Dear All,
> >
> > I am currently looking for software to extract the whole content of a
> > website to automatically convert each site of this webpage to a PDF.
> >
> > I am aware of Acrobat XI Pro. However, after multiple attempts I
> > encountered the problem that it is limited to 10.000 levels (even when
> > indicating to extract the whole site) and the programme crashes before
> > finishing the job. As I am working with governmental websites, with huge
> > amounts of content, this is not sufficient.
> >
> > I am also aware of GNU Wget, yet it only exports the sites in HTML
> format.
> > As I would like to analyse the content in a Qualitative Data Analysis
> > Software, specifically MAXQDA, which does not allow the import of HTML
> > data, I am struggling here as well.
> >
> > I was wondering if anyone has ever conducted a research with a similar
> > technique before and if you are aware of software which could support my
> > data collection/extraction process.
> >
> > Any advice would be greatly appreciated!
> >
> > Thank you,
> > Leonie
> >
> > ___________________________________________
> >
> > Leonie Maria Tanczer
> > PhD Candidate
> > School of Politics, International Studies & Philosophy
> > Queen's University Belfast
> > Twitter: @leotanczt
> > http://bit.ly/1d7O7kj <http://bit.ly/1d7O7kj>
> > _______________________________________________
> > The Air-L at listserv.aoir.org mailing list
> > is provided by the Association of Internet Researchers http://aoir.org
> > Subscribe, change options or unsubscribe at:
> > http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >
> > Join the Association of Internet Researchers:
> > http://www.aoir.org/
> >
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>



More information about the Air-L mailing list