[Air-L] urlte.am to back up URL shorteners

gus andrews gus.andrews at gmail.com
Wed Jun 8 16:12:58 PDT 2011


So here's a project which should be tremendously useful to Internet
researchers, just announced by Jason Scott, a "rogue archivist activist":
URLTE.AM. The project will apparently be making backups of every URL coded
for by a URL shortener, thereby keeping track when links go dead and expire.
Between this and archive.org, a lot less Internet history should be lost.

>From the description at urlte.am:

"Welcome to the URLTeam website. The URLTeam is the ArchiveTeam subcommittee
on URL shorteners. We believe that they pose a serious threat to the
internet's integrity. If one of them dies, gets hacked or sells out,
millions of links will stop working. Thus we preemptively release backups,
because URL shorteners are too busy to make backups themselves.

Releases
Every 6 months or so we release a torrent of all backed up files. When a new
torrent is released, you can simply delete the old torrent and download the
new files to the same location. Your BitTorrent client will figure out which
files have changed and will redownload them, even if you weren't finished
yet on the previous torrent.

The latest torrent was released on May 31th, 2011: urlteam.torrent - List of
files - Readme

The next release is planned around December 2011.

Data format
All data files in the torrent are simple text files compressed using
LZMA2/xz. The text file format is very simple: Each line contains one
mapping in the following format: Shortcode, pipe (Ascii 0x7C), long URL,
line feed (Ascii 0x10). The file is sorted by shortcodes using the following
order:

Shorter shortcodes come before longer ones
Decimal digits
Lowercase letters
Uppercase letters
ASCII value
Depending on the URL shortener there might be multiple long URLs for one
shortcode.

There are some tools for working with the files on GitHub.

Q&A
Can you do a backup of shortener XY please?
Maybe. Some shorteners are very fast at banning scrapers which makes it
impossible to do a backup in an efficient way. Contact us and we will look
into it.

What about 301Works.org? They help URL shorteners with backups.
Unfortunately they rely on the cooperation of URL shorteners, and many of
the biggest URL shorteners refuse to cooperate. Furthermore, they don't plan
on releasing any data files to the public. We do however greatly value their
work and when selecting which URL shorteners we will scrape next we
concentrate especially on those that don't cooperate with 301Works.

Since March 2011 we are actively uploading data from non-cooperating
shorteners to the 301Works archive. While those files are not available for
download (they contain the same data as our torrents anyway) you can watch
our progress here.

I like what you do. Can I help?
Sure thing. We can always use people who help with scraping. Or programmers.
We could also use a fast server with lots of space for storing the data and
seeding the torrents. Anyhow, if you want to help, please contact us.

What's with all those weird directory and file names in the torrent?
It's hard to organize that stuff into files so that each individual file is
only a few hundred megabytes in size. Because we want to accomodate people
with inferior operating systems we also need to assume a case insensitive
file system."

Gillian "Gus" Andrews
Postdoctoral Fellow, Google Academic Research Grant
Anthropology in Education
Teachers College, Columbia University
http://gandre.ws



More information about the Air-L mailing list