[Air-L] computing on billions of words of academic literature and the open web

kalev leetaru kalev.leetaru5 at gmail.com
Mon Sep 15 16:10:12 PDT 2014


I thought many of you on this list would find of great interest our latest
paper out today, which represents one of the first pilot large-scale
content analyses of JSTOR, DTIC, and the Internet Archive.  The hope is
that this paper will serve as a blueprint and template for others and
inspire, seed, and enable a new wave of large-scale internet and literature
content analysis research and to open the door to new disciplinary
applications like socio-cultural and area studies work.

For those interested in working with academic literature collections like
JSTOR, government document repositories like DTIC, or the open web via the
Internet Archive, this paper provides a blueprint for how to work with the
collections, their nuances, artifacts, and strengths, lessons learned (for
example how to work with the Internet Archive's 1.6-billion PDF archive in
the absence of fulltext search or metadata), and general workflows.


http://dlib.org/dlib/september14/leetaru/09leetaru.html


ABSTRACT
The vast array of academic literature published by the humanities and
social sciences disciplines codifies our collective scholarly understanding
of how societies function and the beliefs, ideals, and ethnic, religious,
and tribal contexts that undergird global societal behavior, yet this
material has been largely absent from the recent computational revolution
in the study of culture. Applying temporal, geographic, thematic, and
citation algorithms to an archive of more than 21 billion words spanning
1.5 million publications from 7 collections, including the entire contents
of JSTOR, DTIC, CORE, CiteSeerX, and the Internet Archive's 1.6 billion
PDFs, academic literature is seen to offer a powerful new lens onto global
culture. Four case studies demonstrate using this archive to map the Nuer
ethnic group and identify its top experts, map the literature on food and
water security, explore the thematic underpinnings of the Rwandan genocide,
and construct a network over the ethnic groups of the world as seen through
the combined academic literature of the past half century.



Kalev Leetaru
2013-2014 Yahoo! Fellow, Georgetown University
http://kalevleetaru.com/



More information about the Air-L mailing list