[Air-L] new open "diff" catalog of global online news changes
kalev leetaru
kalev.leetaru5 at gmail.com
Tue Aug 28 06:22:29 PDT 2018
For those interested in measuring the fluidity of the online sphere and in
particular how much online news coverage changes over time (from 404's to
redirects to title and text editing), we've released this morning the new
GDELT GDG, which recrawls every article monitored by GDELT again after 24
hours and after one week and catalogs all of the changes it observes. Text
changes record only changes to the article text itself, not the surrounding
page. Changes are reported at the "word" level for space delimited
languages and character level for others (currently for Burmese, Chinese,
Dzongkha, Japanese, Khmer, Laothian, Thai, Tibetan and Vietnamese, with
more being added shortly).
We're particularly excited about the ability to assess change globally
across countries and languages and at scale, across everything GDELT
monitors each day.
The resulting global change log is all open data and available in one
minute updates as JSON files, a BigQuery table and an RSS feed for web
archives (allowing them to recrawl changed pages).
This is an alpha grade release, so you will undoubtedly find some rough
edges, but we're incredibly excited to see what people are able to do with
it!
https://blog.gdeltproject.org/announcing-the-gdelt-global-difference-graph-gdg-planetary-scale-change-detection-for-the-global-news-media/
You can also couple this with our global frontpage outlink monitoring (35
billion outlinks to 240 million unique URLs to date) to assess what percent
of homepage links are edited over time:
https://blog.gdeltproject.org/announcing-gdelt-global-frontpage-graph-gfg/
Kalev
More information about the Air-L
mailing list