[Air-L] Seeking solutions to small text search mystery

Cornelius Puschmann cornelius.puschmann at uni-duesseldorf.de
Tue Jan 11 00:53:29 PST 2011


Another possible solution (depending on what operating system you are on) is
pdftotext (http://linux.die.net/man/1/pdftotext) in combination with grep
(on Linux or OSX). This makes more sense if you're planning to script the
process -- when working manually the mentioned tools are probably more
comfortable to use.

Best,

Cornelius

On Mon, Jan 10, 2011 at 9:24 PM, Craig Scott <crscott at rutgers.edu> wrote:

>
>
> Colleagues, a graduate student and I could use your help solving a mystery
> related to computerized text searching/coding of online documents.  We are
> examining documents (all saved as .pdf files) using the advanced search
> tool
> in Adobe Reader. While that tool generally works fine, it does not seem to
> recognize certain fairly standard statistical/mathematical symbols (such as
> the p used in statistical significance testing and symbols such as <, >, or
> =) in numerous documents.  This is true even when we directly cut and paste
> the symbol in question into the search tool (surprisingly, it still does
> not
> recognize that symbol in the document). The problem occurs only with
> certain
> sources (such as all articles from certain journals), even when the rest of
> the article is fully searchable. This is happening with very recent
> documents published after 2000 (we are not searching older ones). We
> suspect
> these symbols might be part of some equation editor or specially formatted
> text, but we don't know.
>
> Has anyone else encountered and solved a similar problem? Do you have any
> other suggestions on a search tool for .pdf documents that might be
> superior? We would also welcome any suggestions on other ways to save these
> documents and search them that would address this (I think we could do
> optical character recognition, but fear that may create other accuracy
> problems). Thanks for any suggestions/thoughts you have related to helping
> us solve this frustrating little mystery.
>
> Craig
>
> Craig R. Scott, Ph.D.,
>
>   Associate Professor, Department of Communication &
>
>   Director, Ph.D. Program
>
> School of Communication & Information
>
> Rutgers University
>
> 4 Huntington Street, New Brunswick, NJ 08901
>
> Voice: 732-932-7500 x8142; Fax: 732-932-3756
>
> Office in 201 DeWitt (185 College Avenue)
>
> Web:  <http://comminfo.rutgers.edu/directory/crscott/index.html>
> http://comminfo.rutgers.edu/directory/crscott/index.html
> <https://www.scils.rutgers.edu/directory/crscott/index.html>
>
> Linked in:  <http://www.linkedin.com/pub/11/b83/241>
> http://www.linkedin.com/pub/11/b83/241
>
>
>
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>



-- 
Dr. Cornelius Puschmann, M.A.

Department for English Language and Linguistics
Heinrich-Heine-Universität Düsseldorf
Building 23.11, Level 1, Room 21
Universitätsstrasse 1
40225 Düsseldorf
Germany

+49 211 81 15927 (office)

Nachwuchsforschergruppe "Wissenschaft und Internet" /
Junior Researchers Group "Science and the Internet"
http://nfgwin.uni-duesseldorf.de



More information about the Air-L mailing list