[Air-L] Government website harvesting

abram stern (aphid) aphid at ucsc.edu
Mon Jan 16 17:09:04 PST 2017


I've been scraping media (hearing videos and associated pdfs/transcripts)
and metadata from a handful of legislative committees. This has been
complicated by the fact that they use proprietary Flash-based streaming
servers (AdobeHDS, provided by Akamai), which requires a sniffing
authentication keys and re-assembling many ~1 sec video fragments back into
a whole.  They also block Tor exit nodes -- ironic for the Senate Select
Committee on Intelligence...
-a

On Mon, Jan 16, 2017 at 4:45 PM, Ed Summers <ehs at pobox.com> wrote:

> There is also the #DataRescue effort that seems to be a loosely knit group
> of activists that includes some folks from the Internet Archive.
>
> https://envirodatagov.org/
> https://github.com/edgi-govdata-archiving/
> http://www.ppehlab.org/blogposts/2017/1/15/datarescue-philly-builds-
> datarefuge
>
> Apologies if it was mentioned already...
>
> //Ed
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/
> listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>



More information about the Air-L mailing list