[Air-L] Ethics of using hacked data.
bendert.zevenbergen at oii.ox.ac.uk
Mon Oct 19 03:38:47 PDT 2015
Interesting question! Apologies for this very late answer (due to travel
to tech ethics workshops!).
I will attempt an answer, but take into account my background as an
Information lawyer and my project on the ethics of networked systems
research (so more about Internet architecture experimentation, less about
the use of content: http://ensr.oii.ox.ac.uk). I will not try to give
formal legal advice here, just some things to consider. There are many
subtleties that I won’t go into. The most important point to consider is
**Just because data is available, doesn’t mean you’re not violating rights
or ethical principles by using it (research or otherwise). In other words:
Just because you can, doesn’t mean you should.**
Some analogies: even though you can tape music from the radio and sell the
tapes at a local market, doesn’t mean you’re not violating copyright. Yes,
you can collect WiFi signal data while you’re taking pictures of every
corner of the Earth, but Google learned that it doesn’t mean it’s
therefore legal. In short, information isn’t always as free as it
echnically appears to be.
(1) Privacy laws, data protection frameworks, and ethical principles
likely do apply to you in this case. None of these people whose data was
breached consented to be part of your research. It’s not clear to me
exactly which data you’d want to use to create aggregated measures, but
just be aware that processing any information that is linked to an
identifiable person would likely constitute a breach of relevant
privacy/data protection laws. By identifiable I mean identifiable by
anyone, including those with more computing capacity or resources than you
(difficult to say where to draw the line in this hypothetical assessment).
(2) Your statement "I don't see how currently anyone could have an
expectation of privacy any more” doesn’t hold. these people had an
expectation of privacy at the time of communicating this data, or
interacting with the website. They did not intend for their data to be
used for academic research, or any other processing outside of the context
of the service they trusted. You’d be changing the context and the
audience of this information. Besides the legal issue addressed in the
previous paragraph, you should take into account the potential harm of
using the information in this new context & audience that you’re creating
for this information (inc. the dissemination of your papers and datsets).
Even the fact that it’s now leaked doesn’t change this per se. Maybe the
audience is now script kiddies, spammers, and intelligence agencies, but
you’d be adding even more unintended audiences to this.
So if you use data that can be used to identify someone, somehow, you’re
likely breaching laws. This is less so if you’re only using aggregated
data, but I wonder how you'd construct this data without using personal
Further, even if it turns out you’ve found a way to not violate laws,
which is possible, you’re still in an ethical grey zone:
(3) It could be argued that by using this data, you’re (implicitly)
condoning the act of hacking and publishing this data. Stronger, still: If
you profit from using information from this breach (publications and other
career enhancements), you could entice others to also work with leaked
data, therefore potentially incentivising (and even justifying) hackers
for their acts (“for science!”). These statements may be a bit far fetched
for some, but there is some value to thinking about this. You’d be setting
a precedent regarding working with hacked data that may be difficult to
I appreciate that huge data sets like this one are a social scientist’s
dream come true, but better ways must be found to access them.
Although this is up for debate, it seems to me that academics are
perceived to have a different ethical framework than activists or
journalists. They, too, need to take into account ethics, but their
purposes and perceived benefits differ, so they whole weighing and
justification process is different.
In a recently published extended workshop report of the above named
project, we discuss some cases that are similar to yours. We don’t discuss
hacked data directly, but some of the considerations and lessons drawn
from this report will be useful for your thinking about this. Find it
Also, feel free to present this case to a panel for Networking and
Security that we’re running for a more detailed response (I can’t cover it
all here): https://www.ethicalresearch.org/efp/netsec/
Finally, I’ be interested to write up this case study about this with you
for a particular venue, but we can discuss this between us.
DPhil (PhD) Candidate
Oxford Internet Institute
University of Oxford
Senior Fellow Open Technology Fund
On 07/10/2015 16:55, "air-l-request at listserv.aoir.org"
<air-l-request at listserv.aoir.org> wrote:
>Date: Wed, 7 Oct 2015 16:11:31 -0400
>From: Nathaniel Poor <natpoor at gmail.com>
>To: AOIR <air-l at listserv.aoir.org>
>Subject: [Air-L] Ethics of using hacked data.
> <CACdJtt91rBo4BG80Svnd7NO9AOKjVbo5kUVus9V0VmazsmvrJw at mail.gmail.com>
>Content-Type: text/plain; charset=UTF-8
>I recently got into a discussion with a colleague about the ethics of
>hacked data, specifically the Patreon hacked data (see here:
>He and I do crowdfunding work, and had wanted to look at Patreon, but as
>far as I can tell they have no easy hook into all their projects (for
>scraping), so, to me this data hack was like a gift! But he said there was
>no way we could use it. We aren't doing sentiment analysis or anything, we
>would use aggregated measures like funding levels and then report things
>like means and maybe a regression, so there would be no identifiable
>information whatsoever derived from the hacked data in any of our
>work (we might go to the site and pull some quotes).
>I looked at the AoIR ethics guidelines (
>), and didn't see anything specifically about hacked data (I don't think
>"hacked" is the best word, but I don't like "stolen" either, but those are
>One relevant line I noticed was this one:
>"If access to an online context is publicly available, do
>perceive the context to be public?" (p. 8)
>So, the problem with the data is that it's the entire website, so some was
>private and some was public, but now it's all public and everyone knows
>To me, I agree that a lot of the data in the data-dump had been intended
>be private -- apparently, direct messages are in there -- but we wouldn't
>use that data (it's not something we're interested in). We'd use data like
>number of funders and funding levels and then aggregate everything. I see
>that some of it was meant to be private, but given the entire site was
>hacked and exported I don't see how currently anyone could have an
>expectation of privacy any more. I'm not trying to torture the definition,
>it's just that it was private until it wasn't.
>I can see that some academic researchers -- at least those in computer
>security -- would be interested in this data and should be able to publish
>in peer reviewed journals about it, in an anonymized manner (probably as
>example of "here's a data hack like what we are talking about, here's what
>I also think that probably every script kiddie has downloaded the data, as
>has every grey and black market email list spammer, and probably every
>botnet purveyor (for passwords) and maybe even the hacking arm of the
>Chinese army and the NSA. My point here is that if we were to use the data
>in academic research we wouldn't be publicizing it to nefarious people who
>would misuse it since all of those people already have it. We could maybe
>help people who want to use crowdfunding some (hopefully!) if we have some
>results. (I guess I don't see that we would be doing any harm by using
>So, what do people think? Did I miss something in the AoIR guidelines? I
>realize I don't think it's clear either way, or I wouldn't be asking, so
>probably the answers will point to this as a grey area (so why do I even
>ask, I am not sure).
>But I'm not looking for "You can't use it because it's hacked," because I
>don't think that explains anything. I could counter that with "It is
>publicly available found data," because it is, although I don't think
>that's the best reply either. Both lack nuance.
>Nathaniel Poor, Ph.D.
More information about the Air-L