[Air-L] PDF is proprietary?

Eric Mill konklone at gmail.com
Sat Jan 19 09:31:18 PST 2013


There's a big difference between searchable, and machine readable.

For example, one set of PDFs I've worked with pretty extensively is the
House of Representative's statement of
disbursements<http://disbursements.house.gov/>(how the House spends
its money). The House releases these PDFs in a fully
searchable form - they're not images, they contain all the text displayed
in the PDF.

But what they're releasing is really a database - it's expenses! - and if
you want to do any sort of basic
analysis<http://sunlightfoundation.com/blog/2012/02/06/turnover-in-the-house/>(like
summing numbers together), you need more than a searchable PDF. A
couple of coworkers and I have figured out a Python
script<https://github.com/sunlightlabs/disbursements/blob/master/process_new_release/1_parse_disbursements/parse-disbursements.py>that
does a pretty good job at generating a CSV (spreadsheet) from the PDF,
and so my organization, the Sunlight Foundation, has published these
CSVs<http://sunlightfoundation.com/projects/expenditures/>as a public
service for a few years.

That Python script may look small, but it's quite specific and
brittle<https://github.com/sunlightlabs/disbursements/pull/1>,
is the result of many hours of collective work, and I cross my fingers
every quarter that the House not change a single thing. We're very lucky
that the original PDF is neatly tabular, with one entry per row. The
Senate, on the other hand, started
publishing<http://www.senate.gov/legislative/common/generic/report_secsen.htm>similarly
searchable PDFs at the end of 2011 -- but simply because
individual expenditures span more than one row, it makes writing a parser much
harder<http://sunlightfoundation.com/blog/2011/11/30/senate-finally-publishes-its-spending-online-but-could-do-much-better/>,
and it's so far dissuaded us from trying.

PDFs are often quite suitable for documents, and most these days are
searchable, but they are not machine readable.


On Sat, Jan 19, 2013 at 9:25 AM, Marianne van den Boomen <
M.V.T.vandenBoomen at uu.nl> wrote:

>
>
> On 15/01/2013 16:28, Burcu Bakioglu wrote:
>
>  Also as danah was saying, PDF is not a machine readable format, so search
>> engines can only index by the title, not the content of the article. So
>> those who are searching it won't readily find it.
>>
>
> While this was true in the 90s, most PDFs are now not just images but
> parsed by OCR and therefore as full text indexable by search engines. Try
> this on Google with any obscure pdf you have uploaded - it will pop up.
>
> kind regards
>
> Marianne van den Boomen
>
>
>
>
> Media and Culture Studies | University Utrecht
> Office: Kromme Nieuwegracht 20 (room T2.13A)
> Mail: Muntstraat 2a | 3512 EV UTRECHT
> Phone: +31 (0)30 253 9607
> M.V.T.vandenBoomen at uu.nl | www.hum.uu.nl
> www.newmediastudies.nl | www.vandenboomen.org
> ______________________________**_________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/**
> listinfo.cgi/air-l-aoir.org<http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org>
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>



More information about the Air-L mailing list