[Air-l] Microsoft's In-House Sociologist

Wed Aug 20 16:24:09 PDT 2003

http://news.com.com/2008-1082-5065298.html

CNET News

August 19, 2003

Microsoft's In-House Sociologist

By Paul Festa, Staff Writer

Ever get the feeling your Usenet newsgroup list is being
watched? By Microsoft?

If so, consider yourself right. Thanks to the expertise of
sociologist Marc Smith, Microsoft is keeping a close eye on
newsgroups and other public e-mail lists, which it has
identified as the Internet's undervalued "knowledge
management application."

In Microsoft's research and development labs, Smith has
spent the past several years slicing and dicing data about
messages and message authors in an ambitious effort to help
people make sense of the newsgroup manifold -- the hordes of
know-it-alls, flame warriors, spammers and neophytes who, by
Smith's estimate, last year numbered more than 100 million
in the Usenet network of e-mail threads, or newsgroups.

Smith's idea is that you can tell a lot about the quality of
data by tracking its newsgroup contributors' social habits
-- a notion that holds promise for sorting through millions
of messages, and peril for a online world increasingly
skittish about invasions of privacy.

Following the launch of Microsoft's NetScan application for
analyzing newsgroups and the people who post to them, Smith
spoke to CNET News.com about NetScan, about Microsoft's
interest in e-mail lists and about an application under
development that would link objects in the real world to an
array of online information.

CNET News [CN]: How did a guy like you get to work for a
company like Microsoft?

Marc Smith [MS]: I'm a sociologist. I've now been at
Microsoft Research about four-and-a-half years. Microsoft
has a few social and cognitive psychologists, but I'm the
only sociologist.

[CN]: Which means what, exactly, in the context of
technology employment?

[MS]: A sociologist studies the attributes of relationships
and the group of relationships that add up to a collective
or a community. As a technology group, our mandate is to
both explore and to build tools to study the phenomenon that
we could call online community. We sociologists don't like
to use the term "community," particularly -- we like to
refer to them as social cyberspaces.

[CN]: What's wrong with "community"? The word seems to come
up all the time when we talk about the Internet.

[MS]: When we say "community," perhaps what we really are
looking at is a special case of a broader phenomenon that
sociologists call collective action, when a group of people
do something together. And this turns out to be the No. 1
thing people do with their computers: It's to send each
other e-mail. The No. 2 thing is to send groups of people
e-mail -- to join the list of people who like to knit, or
who like Microsoft products.

[CN]: So why exactly does Microsoft need a resident
sociologist?

[MS]: Microsoft has a big investment in online communities,
and has not had until recently many tools to enhance that
investment. What Microsoft wants around communities is what
every enterprise does, which is a peer-support,
knowledge-management application. And that means that if you
go into Usenet, you'll find 3,000 Microsoft public
newsgroups, with 1.5 million people posting 10 million
messages. And that's 2002 -- and it's going to more than
double this year, because it more than doubled in '01. We
don't see traffic flagging at all.

[CN]: My impression was that the use of e-mail lists was on
the decline.

[MS]: To the contrary! It's on the rise. Usenet alone --
which is a backwater in that most people don't know where it
is and how to find it -- on Usenet alone there were 13.1
million unique identities who used Usenet in 2002, and by
that we mean that they were a contributor and wrote at least
one message. How many people read the message? We have no
idea. That number is invisible and is fragmented over a
half-million servers that are not sharing their data. But
conservatively you could estimate that there are 10 readers
for every writer, so that makes it 130 million Usenet users
per year. And that's a small number compared to majordomo
lists, or things like Yahoo Groups, and the number of people
who have a bulletin board on things like UltimateBBS.

[CN]: What are you doing with these lists, from a
sociological standpoint?

[MS]: What we are about is the thread. It turns out that the
core sociological data type of the Internet is not IP
(Internet Protocol) numbers, or any of that stuff, it's
threaded conversations. And it's amazing how little It turns
out that two-thirds of all threads in Usenet, in 2002, had a
whopping two messages. investment has been put into adding
value to the core data structure of the Internet, which is
the conversational thread. I can illustrate that by
suggesting that when you sit in front of your e-mail client,
simply try to sort your messages by thread size.

[CN]: And by size of the thread you mean...?

[MS]: I mean the number of messages, the number of
generations of messages, the breadth of the conversation. If
eight people reply to a message, it has a breadth of eight.
If 12 reply, it's 12. And it turns out that the frequency
distribution of thread properties is very illuminating.

It turns out that two-thirds of all threads in Usenet, in
2002, had a whopping two messages. And two-thirds of all
authors are the people who write a message, post once one
day, and never again.

[CN]: Is that indicative of a spam problem?

[MS]: No, those aren't spammers, they are the people who
post once, get their answer and go away happy. They post a
message that says they can't print, then they get their
answer. What newsgroups are is a form of knowledge
management application. What they are about is leveraging
the collective knowledge of large numbers of people.

[CN]: So how is it useful to know that people are getting
their printing questions answered? What can you do with that
information?

[MS]: What you can do is say, "Let's look at how many times
each of those unique IDs posted. Twenty-four million times?
That's your spammer." Humans have a limited capacity to type
and send and think up messages, while software is virtually
free from those constraints. What we do is say, "By looking
at these properties, the structure of authors, threads and
newsgroups, we can determine a lot of things that are good
predictors of value."

Here's an example: Let's say you have a newsgroup with
22,000 messages posted there per month. You have a problem!
What should you read? We have some suggestions. In an
existing browser, you can see the messages sorted by date,
sorted by size or sorted alphabetically, and this is not
very useful. What we want to say is, "There are different
vectors through this content space, different ways of
slicing into the data, the conversation, that are more
likely to bring valuable information."

For instance, what are people talking about? What we've done
is highlight the 40 threads that got the most number of
messages in this period -- day, week, month, year. And we'll
say, "Here are 40 really big threads." How do you know those
are good? We're not sure they were good, but these were the
things that got people really excited and engaged in this
newsgroup. That's one vector.

[CN]: But what about the guy who gets his printer fixed in
two messages?

[MS]: And you can legitimately argue that. "What about small
threads of high value? How can you help me find them?" The
answer is that we are, by leveraging latent structural data
that is itself a product of collective behavior. You have
lots of individuals working on their own. If there were only
one person writing Web pages, Google wouldn't work. But
Google Groups doesn't do what we do to Usenet. We're doing
something useful to Usenet. We're not yet a search engine,
we're a research project. And we will eventually be doing
things related to the full text of the message.

Let's look at the individual who posts to a list. Does he
show the pattern of participation over time that is an
indicator of a valuable contributor? The question you should
raise is, "What do you mean by value?" One man's flame
warrior is another man's poet. It's not for us to tell you.
But we do give you tools to sort patterns of difference.

Let me tell you how to find someone who gives really good
technical support answers using our author tracker. It's a
way to slice a vector into the content space that measures
how dedicated are the people to this newsgroup. Basically,
it asks, "Are you a regular?"

[CN]: And what will that indicate?

[MS]: Regulars are value contributors. But you could say,
"You are sorting people by -- and we do -- how many days
they come back." For example, you go into some of our tech
support newsgroups, and you'll find that there are I'm a
social scientist -- I don't know the difference between good
and bad, only the difference between difference. People who
have contributed every day in the month. OK, those are
regulars. But how do you know they have value? It's not just
the number of days you come back. There are three other
metrics, which tend to be ratios. One is the ratio of
replies: How many times did you reply to someone else, or
start a thread? Spammers may show up every day, but they
don't reply. With a very low reply-to-post ratio, I would
say that that is a person who starts a lot of conversations
but never replies to anyone else, and it's probably a
spammer. Showing up every day is not enough -- you have to
respond to other people. It's also thread-to-post. How many
threads did you touch, how many messages did you write? If
you wrote 10 times, all into one thread, that's a low ratio.
You have a high conversational concentration.

[CN]: Is that good or bad?

[MS]: I'm a social scientist -- I don't know the difference
between good and bad, only the difference between
difference. Do I like flame warriors? Or don't I? A high
reply-to-post indicates a flame warrior, because they tell
you you're an idiot and they put all their messages into a
few threads -- so they also have a low thread-to-post ratio.

If you want to find the answer person, flip that ratio
around. They differ from the flame warrior in the following
way: Both show up every day, and both reply. The answer
person answers a post once or twice, then moves on. We've
seen people post 500 messages in one week in one thread. If
you have that much time on your hands -- it's not to say
that it's a good thing or a bad thing, but a different
thing. We give you the opportunity to say, "I just came here
because I can't print." We will guide you to the very real
group of people who are dedicated, for whatever reason, to
not just computer technology, but answering questions about
knitting, horseback riding, dogs -- you name it. And the way
to do that is to start looking at the social accounting
metadata about authors.

[CN]: So could all of this ultimately add up to a better
search engine?

[MS]: If things go well, we'll have a better search engine.
This remains early, initial research, but our results look
promising. Reranking results based on social histories does
do a better job, and I do believe we will deliver interfaces
that will find people who are debators, fine, but also those
who are answer people...It turns out that people have a lot
to give each other. There's a lot of knowledge to share, and
2 percent of every population is motivated to be a knowledge
sharer.

Most of us have to rely on signs or symbols that suggest a
person is reliable. With doctors you have their diplomas,
the way the office looks, and most important, who referred
you -- these are all indicators that we rely on. We are
trying to create analogous tools for online environments
where that data is latent, is not manifest in the interfaces
visibly.

[CN]: When you talk about a reputation system, I'm reminded
of the eBay system.

[MS]: We're similar but different -- eBay is an explicit
feedback system, and we are an implicit feedback system.
With eBay, buyers rate sellers, and sellers rate buyers,
after they conduct a transaction. It's what people say about
you. But there are real problems with this -- most of all
inflation, the "Beverly Hills-adjacent" problem. If you read
the L.A. real estate section, everything is "Beverly
Hills-adjacent." So there is this tendency to inflate. There
have been empirical studies of reputation ratings at eBay
that suggest that just going by reputation ratings at eBay
is not an indication that you're not going to get a
fraudulent transaction.

[CN]: Tell me about the AURA (Advanced User Resource
Annotation) project.

[MS]: AURA is about extending NetScan: "What if you could
use NetScan with a pocket computer and attach threads to
things?" We use the Toshiba e740 and a Compact Flash
bar-code reader, run AURA software, and can walk up to any
bar-coded object, any ISBN-coded object, scan it, and the
device brings back information about that object...We
imagine being able to walk up and down the aisle of a
grocery store and have a handheld computer rate everything
with a green light, a red light, a skull and crossbones.

In Hong Kong, during the height of the SARS outbreak, there
was a system that could tell you which buildings had had
confirmed SARS cases. Now that's a reputation system.

[CN]: It's easier to do this with products than with, say,
people.

[MS]: People are one thing, but objects -- all the books on
my shelves, all the food in my kitchen, the artworks in the
hallway -- we at Microsoft have bar-coded every one of them.
AURA is going to become a navigation tool. You can print a
bar code for a penny and slap them on things. Which we do --
and then Facilities comes along and scrapes them off.

[CN]: It seems that once Microsoft starts tracking the
behavior of individuals, you're asking for trouble. What
about privacy?

[MS]: I think it's a very important thing. And we have built
NetScan to protect what I think are legitimate claims for
privacy. Like a Net spider, NetScan takes publicly
accessible documents off the Internet, and it respects
metadata that says "Leave me alone!" There is the robots.txt
file that says, "You can look at this but not that." With
Usenet there is one that says "Leave my messages alone," and
we respect that. We will not store your messages if you put
that in them.

[CN]: Couldn't a spammer just put that in his or her
messages, so you wouldn't be able to identify them as a
spammer?

[MS]: That's a possibility, and that's something we would
have to respect. But the system still would not fail,
because a person with no reputation is a person who has a
reputation. "Let me tell you about the people who the system
has shown to have value." We're about letting the cream
float the top and not about letting the other stuff sink.

[CN]: How can you reassure someone who might be concerned
that it's not such a good idea for computers to be keeping
track of our belongings and our whereabouts?

[MS]: I'm not sure, but we're leaking data all over the
place now. And on the one hand, that has utility for other
people. On the other, there's a privacy risk. In some ways,
consider us a form of performance art. Would you like to see
you? This is potent. We accept that and hope we can offer
people good prophylactics against loss of privacy. And that
may mean keeping multiple IDs and e-mail addresses.
Ultimately we may have to fragment our identities.