[Air-L] Sampling strategies for classification tasks

Wed Apr 29 04:18:56 PDT 2020

Sina,

You face many key choices, not all of them algorithmic:

- how many tasks?
- which are the easy one ones versus which are hard?
- what is the best order of tasks?
- why 4 categories?
- will you make categories mutually exclusive or not?
- how will you resolve boundary cases?
- is there enough unlabeled and labeled data to build a balanced model?
- can you create original, corpus-based labels, yourself or in a
small group?
- do labels created by others, at another time, for another reason, using a
different corpus, work?
- how will you validate the labels, or the model they produce, is accurate?
- how do you know a good annotator/evaluator from a poor one?
- how does the task-specific aptitude of the annotators impact the
training data and the model?

Let me explain my thinking on these ideas, which reflect more than 30 years
of labeling that started with a pen and paper undergraduate senior thesis
in 1988, included 10 years of using and teaching NUD*IST, NVivo, and
Atlas.ti, and my own software development journey to roll everything I know
into two labeling platforms.

The number of tasks is most important: one task for all categories or
one/multiple tasks for each category? I always favor the latter. To the
extent you decompose problems into separate tasks, the chance you will get
them done consistently and accurately improves. By breaking the tasks down,
you very quickly (sometimes in minutes) learn what is easy, what is
difficult, and start to think about the best order of tasks, which with
Twitter is often as follows: collect, deduplicate, sample seeds and
singles, scope for relevance issues, and start your first binary
classification, which for me is always relevance. For example, one common
method sequence is to build a relevance classifier first, then a main topic
classifier within the relevant data, then a sentiment classifier within the
relevant, on-topic subset. The end result is much better than going from
raw data directly to four categories with a pre-wrapped model. In Twitter,
duplicates are RTs and you really never want to label the same item over
and over, which is why deduplication is a prerequisite to get going.

Exclusivity in categories is among the most important decisions. I always
try to build mutually exclusive categories, in layers as needed, to keep
the training data as focused as possible on the problem at hand. Even
adding a third category and making the codes non-mutually exclusive will
complicate the human labeling, measurement, and the signal produced in a
machine-learning model. Exclusive categories produce better results in
nearly all machine-learning we do. If I care about three topics, I build
three classifiers:  A-Not A, B-Not B, and C-Not C. Each is a separate task,
easier for the humans and the machines. You can mashup the results.

Boundary cases are the hard part in annotation and machine-learning. Some
are irresolvable. Others you can plan around or learn your way through if
you label in short stints of small groups who all write reflection memos in
a shared Google form after each 5-15 minute labeling session. The best way
to limit the impact of irresolvable boundary cases is to decompose the
problems as noted above. Someone in the end has to decide when multiple
annotators disagree on a boundary case, who is right? We call that process
CoderRank, because it reveals over time that people are never equally able
to understand and execute a classification task. The more people you add,
the clearer this fundamental fact becomes. Coders are not equal; some are
terrible and a small number are truly legendary. You need to know where you
and your annotators sit on the spectrum.

With respect to the availability of data for all categories, if you have
80% of the data category x, 10% category y, 5% category z and 5%
boundary cases, a random sample for training will serve you poorly and the
model will be very difficult to scale and highly dubious on accuracy. In
general, it is best to define classification problems in such a way that
you can get more evenly balanced training sets. This is another reason I
like binary code schemes with lower levels of complexity in the annotation.

This all gets at the core question: should you build your own model or
apply existing training sets? There is no right answer here. In fact, the
optimal approach is probably a hybrid of using relevant training data that
is out in the ether (if it is and you trust it) to jump start the process,
but also your own annotation to see what unexpected difficulties may be
encountered. I rarely code for more than 5 minutes before having to jot
down notes about something I did not foresee in the data. Then you adjust,
experiment, test, validate, and move on informed by every interaction with
the data.

Overall, my biggest caution is to avoid the pitfall of
shortcuts, dashboards, and visualizations that eliminate the need to do the
harder work. There probably is not a perfect training set out there for
you. Only your work over time can determine what is relevant, accurate, and
reportable as a finding. Plato argued categories are hard; he was right. As
researchers, we need to embody this core idea in how we think about
categorization. Just as all coders are not created equal, not all code
schemes are equal either. Some approaches are better, depending on the
desired outcome and underlying theoretical and applied assumptions.

Over the last 26 days, I have labeled 75,000 Twitter user descriptions in
one very important binary model. There are still boundary cases that
produce vexing results in human and machine classification, but the model
is probably the most powerful and accurate I have ever built. For me, it is
the fullest expression of all these ideas and a roadmap of how I will do
this sort of work going forward.

~Stu

On Wed, Apr 29, 2020 at 5:08 AM Sina Furkan Özdemir <sina.ozdemir at ntnu.no>
wrote:

> Dear all,
>
> I have been following some 800 Twitter accounts for my Ph.D. dissertation
> over the last four months. I have ended up with 400.000 tweets that I need
> to categorize by four mutually exclusive categories.
>
> I looked up some previous works with similar tasks, and it seems that the
> best way is to use a combination of word embeddings and recurrent neural
> networks with LSTM structure.
>
> The problem I am having right now is that I couldn't find training data
> for the classification. Can anyone recommend me some literature on sampling
> strategies for short-text classification tasks?
>
> Best,
> Sina Özdemir
> Ph.D. Candidate
> NTNU, Trondheim
> M.A Comparative and International Studies
> ETH Zurich & University of Zurich, Switzerland
> B.A. Political Science and International Relations
> Middle East Technical University, Turkey
>
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>

-- 
Dr. Stuart W. Shulman
Founder and CEO, Texifter