[Air-l] ensuring full-text pdfs [was Re: Citation Managers - Alternatives to Endnote/CiteULike/... ?]

Sun Mar 19 11:06:59 PST 2006

On Mar 19, 2006, at 1:02 PM, Jeremy Hunsinger wrote:

> <snip very helpful workflow description>

> I download or print every document to pdf that i deem appropriate,  
> currently around 11000 different pdfs, I
> make sure that they are all full text pdfs ...

I too strongly prefer full-text pdfs, I'd love to know how you ensure  
this.

I've written to our e-journal subscriptions person at the library  
explaining their importance and requesting that the library prefer  
full-text pdf providers when purchasing eJournal subscriptions.   
Unfortunately this crucial bit of metadata doesn't appear to make its  
way into library catalogues (or services like openurl lookups) so I  
find it hard to remember which providers have full-text searchable  
pdfs while grabbing research papers online.  This means I'm often  
annoyed later finding image PDFs (yes, I'm looking at you JSTOR) in  
my personal library.

To be clear, I'm referring to online journal providers. If one is  
printing html or other full-text documents---ie creating one's own  
PDF records---then it is fairly easy to ensure: It's the default mode  
in 'printing PDF to disk' on OS X, and I assume that Acrobat has that  
same capability on windows.

Any hints?  Are you using OCR to convert the image PDFs?  Is that  
pretty effective?  Does it integrate with DEVONthink?

Perhaps we should create a list of full-text PDF providers on a wiki  
somewhere?  Does anyone in the e-journal purchasing world already  
prefer full-text PDF providers?  Or maybe end-user OCR is sufficient?

Thanks,
James

ps.  apologies for the thread-jack. FWIW I use BibDesk, which is  
great for bibtex/latex integration and searching inside PDFs, but  
lacking in Word integration and Windows-ness.  Quotes go into the  
annotation field linked to the record.  http://bibdesk.sf.net