[Air-L] Sampling strategies for classification tasks

ali hürriyetoglu ali.hurriyetoglu at gmail.com
Wed Apr 29 04:32:19 PDT 2020


Dear Sina,

I can suggest my work on Relevancer [1] to make sense of and facilitate
tweet collections. The whole story is in my dissertation [2].

Let me know if you think I can help you any further.

Best,

Ali

[1] Hürriyetoǧlu, A., Oostdijk, N., Başar, M. E., & van den Bosch, A.
(2017, June). Supporting experts to handle tweet collections about
significant events. In *International Conference on Applications of Natural
Language to Information Systems* (pp. 138-141). Springer, Cham.
[2] Hürriyetoğlu, A. (2019). *Extracting Actionable Information from
Microtexts* (Doctoral dissertation, [Sl: sn]). URL:
https://repository.ubn.ru.nl/handle/2066/204517

Shulman, Stu <stu at texifter.com>, 29 Nis 2020 Çar, 14:19 tarihinde şunu
yazdı:

> Sina,
>
> You face many key choices, not all of them algorithmic:
>
> - how many tasks?
> - which are the easy one ones versus which are hard?
> - what is the best order of tasks?
> - why 4 categories?
> - will you make categories mutually exclusive or not?
> - how will you resolve boundary cases?
> - is there enough unlabeled and labeled data to build a balanced model?
> - can you create original, corpus-based labels, yourself or in a
> small group?
> - do labels created by others, at another time, for another reason, using a
> different corpus, work?
> - how will you validate the labels, or the model they produce, is accurate?
> - how do you know a good annotator/evaluator from a poor one?
> - how does the task-specific aptitude of the annotators impact the
> training data and the model?
>
> Let me explain my thinking on these ideas, which reflect more than 30 years
> of labeling that started with a pen and paper undergraduate senior thesis
> in 1988, included 10 years of using and teaching NUD*IST, NVivo, and
> Atlas.ti, and my own software development journey to roll everything I know
> into two labeling platforms.
>
> The number of tasks is most important: one task for all categories or
> one/multiple tasks for each category? I always favor the latter. To the
> extent you decompose problems into separate tasks, the chance you will get
> them done consistently and accurately improves. By breaking the tasks down,
> you very quickly (sometimes in minutes) learn what is easy, what is
> difficult, and start to think about the best order of tasks, which with
> Twitter is often as follows: collect, deduplicate, sample seeds and
> singles, scope for relevance issues, and start your first binary
> classification, which for me is always relevance. For example, one common
> method sequence is to build a relevance classifier first, then a main topic
> classifier within the relevant data, then a sentiment classifier within the
> relevant, on-topic subset. The end result is much better than going from
> raw data directly to four categories with a pre-wrapped model. In Twitter,
> duplicates are RTs and you really never want to label the same item over
> and over, which is why deduplication is a prerequisite to get going.
>
> Exclusivity in categories is among the most important decisions. I always
> try to build mutually exclusive categories, in layers as needed, to keep
> the training data as focused as possible on the problem at hand. Even
> adding a third category and making the codes non-mutually exclusive will
> complicate the human labeling, measurement, and the signal produced in a
> machine-learning model. Exclusive categories produce better results in
> nearly all machine-learning we do. If I care about three topics, I build
> three classifiers:  A-Not A, B-Not B, and C-Not C. Each is a separate task,
> easier for the humans and the machines. You can mashup the results.
>
> Boundary cases are the hard part in annotation and machine-learning. Some
> are irresolvable. Others you can plan around or learn your way through if
> you label in short stints of small groups who all write reflection memos in
> a shared Google form after each 5-15 minute labeling session. The best way
> to limit the impact of irresolvable boundary cases is to decompose the
> problems as noted above. Someone in the end has to decide when multiple
> annotators disagree on a boundary case, who is right? We call that process
> CoderRank, because it reveals over time that people are never equally able
> to understand and execute a classification task. The more people you add,
> the clearer this fundamental fact becomes. Coders are not equal; some are
> terrible and a small number are truly legendary. You need to know where you
> and your annotators sit on the spectrum.
>
> With respect to the availability of data for all categories, if you have
> 80% of the data category x, 10% category y, 5% category z and 5%
> boundary cases, a random sample for training will serve you poorly and the
> model will be very difficult to scale and highly dubious on accuracy. In
> general, it is best to define classification problems in such a way that
> you can get more evenly balanced training sets. This is another reason I
> like binary code schemes with lower levels of complexity in the annotation.
>
> This all gets at the core question: should you build your own model or
> apply existing training sets? There is no right answer here. In fact, the
> optimal approach is probably a hybrid of using relevant training data that
> is out in the ether (if it is and you trust it) to jump start the process,
> but also your own annotation to see what unexpected difficulties may be
> encountered. I rarely code for more than 5 minutes before having to jot
> down notes about something I did not foresee in the data. Then you adjust,
> experiment, test, validate, and move on informed by every interaction with
> the data.
>
> Overall, my biggest caution is to avoid the pitfall of
> shortcuts, dashboards, and visualizations that eliminate the need to do the
> harder work. There probably is not a perfect training set out there for
> you. Only your work over time can determine what is relevant, accurate, and
> reportable as a finding. Plato argued categories are hard; he was right. As
> researchers, we need to embody this core idea in how we think about
> categorization. Just as all coders are not created equal, not all code
> schemes are equal either. Some approaches are better, depending on the
> desired outcome and underlying theoretical and applied assumptions.
>
> Over the last 26 days, I have labeled 75,000 Twitter user descriptions in
> one very important binary model. There are still boundary cases that
> produce vexing results in human and machine classification, but the model
> is probably the most powerful and accurate I have ever built. For me, it is
> the fullest expression of all these ideas and a roadmap of how I will do
> this sort of work going forward.
>
> ~Stu
>
>
>
>
>
>
>
>
>
>
> On Wed, Apr 29, 2020 at 5:08 AM Sina Furkan Özdemir <sina.ozdemir at ntnu.no>
> wrote:
>
> > Dear all,
> >
> > I have been following some 800 Twitter accounts for my Ph.D. dissertation
> > over the last four months. I have ended up with 400.000 tweets that I
> need
> > to categorize by four mutually exclusive categories.
> >
> > I looked up some previous works with similar tasks, and it seems that the
> > best way is to use a combination of word embeddings and recurrent neural
> > networks with LSTM structure.
> >
> > The problem I am having right now is that I couldn't find training data
> > for the classification. Can anyone recommend me some literature on
> sampling
> > strategies for short-text classification tasks?
> >
> > Best,
> > Sina Özdemir
> > Ph.D. Candidate
> > NTNU, Trondheim
> > M.A Comparative and International Studies
> > ETH Zurich & University of Zurich, Switzerland
> > B.A. Political Science and International Relations
> > Middle East Technical University, Turkey
> >
> > _______________________________________________
> > The Air-L at listserv.aoir.org mailing list
> > is provided by the Association of Internet Researchers http://aoir.org
> > Subscribe, change options or unsubscribe at:
> > http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >
> > Join the Association of Internet Researchers:
> > http://www.aoir.org/
> >
>
>
> --
> Dr. Stuart W. Shulman
> Founder and CEO, Texifter
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at:
> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
>
> Join the Association of Internet Researchers:
> http://www.aoir.org/



More information about the Air-L mailing list