[Air-L] Tool for collecting Instagram images/websites/data?

kalev leetaru kalev.leetaru5 at gmail.com
Sat Sep 17 11:51:34 PDT 2016


Rainer, if you're interested more in tagged imagery rather than Instagram
imagery specifically, the GDELT Visual Global Knowledge Graph (VGKG)
dataset may be of particular interest. It consists of more than 150 million
images drawn from global news coverage worldwide over the last 9 months and
passed through Google's Cloud Vision API deep learning service. Each record
includes the URL of the image, the URL of the article it appeared in, a set
of tags that categorize both the objects and activities depicted in the
image, full OCR (including OCR of script and logographic languages),
identification of major worldwide commercial and NGO logos, estimation of
violence level, facial detection (but not recognition) and an estimation of
the facial sentiment of each human face, and the estimated location the
image was taken (based purely on visual analysis):

http://blog.gdeltproject.org/announcing-the-new-gdelt-visual-global-knowledge-graph-vgkg/

The full dataset is rather large, weighing in at around 850GB and each
record is encoded as a JSON blob which can be quite large for highly
detailed images, so the collection requires a bit of expertise to work
with, but there is a great third party R package that is able to process
the data into a more easily workable format (
https://github.com/abresler/gdeltr2).

Within the next few weeks, additional fields will encode all EXIF, IPTC and
XMP metadata encoded in each image (around 10% of news imagery includes
expanded metadata such as publisher-assigned keywords and textual
descriptions) and three perceptual hashes are being added (Average Hash,
Perceptive Hash and Difference Hash) to allow visual similarity comparison
and search:

http://blog.gdeltproject.org/vgkg-adds-exif-support/
http://blog.gdeltproject.org/vgkg-adds-perceptual-hashing-image-similarity-search/

The dataset currently updates every 15 minutes, but in the next month will
be switching to updating every 1 minute, meaning if you're interested in
realtime visual analysis, this dataset may be of great interest:

http://blog.gdeltproject.org/visual-gkg-to-be-first-gdelt-gen-3-release/

Finally, through a partnership with the Internet Archive, the URLs of all
images in this collection are sent to the Internet Archive each day, which
preserves each image and its corresponding article into their permanent
primary archive that powers the Wayback Machine:

http://blog.gdeltproject.org/gdelt-internet-archives-collaboration-to-archive-the-worlds-online-journalism/


Finally, if you're interested in historical imagery, you might take a look
at the Internet Archive Book Images Collection I built several years ago
with the Internet Archive, extracting the images of more than 600 million
pages of public domain books dating back 500 years from over 1,000
libraries worldwide - the image files, book-level metadata, and the text
immediately surrounding each image as it appeared on the page is all
available:

http://blogs.loc.gov/thesignal/2014/12/unlocking-the-imagery-of-500-years-of-books/
http://www.bbc.com/news/technology-28976849
http://blog.gdeltproject.org/500-years-of-the-images-of-the-worlds-books-now-on-flickr/


Hope this helps!

Kalev
http://blog.gdeltproject.org/
http://kalevleetaru.com/


On Sat, Sep 17, 2016 at 11:27 AM, Rainer Hillrichs <
hillrichs at uni-mannheim.de> wrote:

> Dear all,
>
> I searched on the list and on the web but couldn't find anything: I'm
> looging for a tool that collects Instagram images, websites, and data
> associated with a specific tag. Basically, I want to type in a tag and end
> up with a folder full of images, websites, and a table with data (e.g. user
> name, date posted, URL, other tags). I already suspect that is a lot to ask
> for ;-) Even a simpler tool would be a good start! As long as I don't have
> to to end up saving individual images, websites, and typiing/copying stuff
> into a table.
>
> Suggestions very much appreciated!
> Rainer
>
>
> --
> Dr. Rainer Hillrichs
> Universität Mannheim
> https://uni-mannheim.academia.edu/RainerHillrichs
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/
> listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/



More information about the Air-L mailing list