[Air-L] Ethics of using hacked data.

Wed Oct 7 13:11:31 PDT 2015

Hello list-

I recently got into a discussion with a colleague about the ethics of using
hacked data, specifically the Patreon hacked data (see here:
http://arstechnica.com/security/2015/10/gigabytes-of-user-data-from-hack-of-patreon-donations-site-dumped-online/
).

He and I do crowdfunding work, and had wanted to look at Patreon, but as
far as I can tell they have no easy hook into all their projects (for
scraping), so, to me this data hack was like a gift! But he said there was
no way we could use it. We aren't doing sentiment analysis or anything, we
would use aggregated measures like funding levels and then report things
like means and maybe a regression, so there would be no identifiable
information whatsoever derived from the hacked data in any of our resulting
work (we might go to the site and pull some quotes).

I looked at the AoIR ethics guidelines ( http://aoir.org/reports/ethics2.pdf
), and didn't see anything specifically about hacked data (I don't think
"hacked" is the best word, but I don't like "stolen" either, but those are
different discussions).

One relevant line I noticed was this one:
"If access to an online context is publicly available, do
members/participants/authors
perceive the context to be public?" (p. 8)
So, the problem with the data is that it's the entire website, so some was
private and some was public, but now it's all public and everyone knows
it's public.

To me, I agree that a lot of the data in the data-dump had been intended to
be private -- apparently, direct messages are in there -- but we wouldn't
use that data (it's not something we're interested in). We'd use data like
number of funders and funding levels and then aggregate everything. I see
that some of it was meant to be private, but given the entire site was
hacked and exported I don't see how currently anyone could have an
expectation of privacy any more. I'm not trying to torture the definition,
it's just that it was private until it wasn't.

I can see that some academic researchers -- at least those in computer
security -- would be interested in this data and should be able to publish
in peer reviewed journals about it, in an anonymized manner (probably as an
example of "here's a data hack like what we are talking about, here's what
hackers released").

I also think that probably every script kiddie has downloaded the data, as
has every grey and black market email list spammer, and probably every
botnet purveyor (for passwords) and maybe even the hacking arm of the
Chinese army and the NSA. My point here is that if we were to use the data
in academic research we wouldn't be publicizing it to nefarious people who
would misuse it since all of those people already have it. We could maybe
help people who want to use crowdfunding some (hopefully!) if we have some
results. (I guess I don't see that we would be doing any harm by using it.)

So, what do people think? Did I miss something in the AoIR guidelines? I
realize I don't think it's clear either way, or I wouldn't be asking, so
probably the answers will point to this as a grey area (so why do I even
ask, I am not sure).

But I'm not looking for "You can't use it because it's hacked," because I
don't think that explains anything. I could counter that with "It is
publicly available found data," because it is, although I don't think
that's the best reply either. Both lack nuance.

-Nat

-- 
Nathaniel Poor, Ph.D.
http://natpoor.blogspot.com