[Air-L] applying large-scale NLP linguistic analysis to web archives: 101 billion word nlp dataset

Fri Jan 10 12:17:51 PST 2020

For those interested in the kinds of at-scale research questions large web
archives make possible and/or those interested in web-scale linguistic
analysis and entity understanding, we've just released an open
machine-annotated part of speech dataset from running 101 billion words of
worldwide online news from 100 million English articles 2016-present
through Google's NLP API cataloging machine-assigned part of speech
information (tag, aspect, case, form, gender, mood, number, person, proper,
reciprocity, tense and voice) and dependency label, along with snippets of
each usage, which builds upon a parallel dataset that includes the more
than 11 billion entities found within those articles.

Both datasets are available as open datasets, along with a third dataset
that applies the same entity extraction to a decade of television news on
BBC, CNN, MSNBC, FOX and ABC, CBS and NBC evening news to allow
online-television topical comparisons.

PART OF SPEECH + DEPENDENCY LABELS
https://blog.gdeltproject.org/announcing-the-web-partofspeech-dataset-101-billion-words-part-of-speech-tagged-and-dependency-tree-parsed-using-googles-nlp-api/

ENTITIES

https://blog.gdeltproject.org/announcing-the-global-entity-graph-geg-and-a-new-11-billion-entity-dataset/

TV ENTITIES
https://blog.gdeltproject.org/a-deep-learning-powered-entity-graph-over-television-news-2009-2019/

Kalev