[Air-L] "Big Data" Tools

Sun Apr 19 14:01:35 PDT 2015

Thanks, Bobo. I hope to make these whole system replicable for a public audience so people can build their own archives on collections that matter to them. This is what I envision for R-Shief’s future.

For a simplistic explanation to your questions:

First of all, languages are encoded differently. There are different systems to encode the various character sets in the world’s languages. For example, ASCII is used for English and contains about 120 symbols; UTF-8 and UTF-16 for are used for Arabic, Hebrew, Russian and other character sets; GB 18030 is used for Chinese and EUC for Japanese.  In addition to that simple difference, Arabic, for example is read right to left, which changes things up. All of that requires different programming. And then any semantic processing requires an entire set of other issues to consider.

I would recommend the work conducted by a small group in South Africa a few years back: Effecting Change through Localisation: Localisation guide for Free and Open Source Software  This guide was intended for the Arabic and African Free and Open Source Software (FOSS) communities. I believe they published it in several languages. I have the Arabic and English versions.

Translation is another beast in itself. I am not familiar with Duolingo. Google seems to have done some of the best work in this field, but there are still many errors in automated translation. 

Cheers,
Laila

On Apr 19, 2015, at 1:37 PM, Bobo <the.bobo at gmail.com> wrote:

> R-Shief looks very cool, not just technologically but also in the commitment to open and free access with a non-profit funding model. That must be so much work! To rip off Gandhi, perhaps digital humanists should follow suite more generally and "code the change [we] want to see?"
> 
> A couple questions:
> 
> 1) What's the reason why non-Western languages are harder to scrape from the Twitter API? Does it just not serve double-byte characters (e.g. Chinese characters) well?
> 
> 2) Curiosity question - localization is hard to do. Has any work gone on in automating archival material translation through something like Duolingo?
> 
> Best,
> Bobo
> 
> On Sun, Apr 19, 2015 at 4:15 PM, VJ Um Amel <laila at vjumamel.com> wrote:
> Thanks for bringing up this issue. I have mentioned this several times in my research regarding the Arab uprisings. When eighty to ninety-nine percent of all social media content on social movements in the Middle East is in Arabic, it is clear that we must conduct our research in that language. However, as you mentioned, there is a lack of tools, access, and overall research.
> 
> My doctoral work included building the R-Shief media system (http://r-shief.org) that has archived and analyzed 18 billion posts over five years in over seventy languages with a specific emphasis on Arabic (http://kal3a.r-shief.org/search). We started collecting tweets by hashtags in Arabic as soon as Twitter made that functional in March 2012 (http://r-shief.org/historical-archive/). And we have also built an open source Arabic Text Analyzer (http://r-shief.org/tools/arabic-entity-extraction/), and conducted semantic and sentiment analysis in Arabic. Our work and tools have only touched the surface (http://r-shief.org/tools/). There is lot more to be done in open source software localization in non-Western, non-English languages.
> 
> 
> ---
> Laila Shereen Sakr </VJ Um Amel>
> PhD in Media Arts and Practice
> USC School of Cinematic Arts
> http://vjumamel.com
> http://r-shief.org
> +1-202-462-6242
> 
> 
> 
> On Apr 15, 2015, at 2:06 PM, kalev leetaru <kalev.leetaru5 at gmail.com> wrote:
> 
> > One of the biggest issues that I see on a daily basis in the policy world
> > is that the vast majority of "big data" work (and even "little data" work)
> > are based primarily or exclusively on English-language and/or Western data
> > sources and attempt to use such sources to make arguments about current
> > events, narratives, and emotions in the non-English non-Western world.
> > There are simply far more tools available for performing analysis of
> > English material than there are for Swahili, for example, or even Arabic,
> > and bilingualism is not as prevalent in many areas of study, so I end up
> > seeing an incredible number of studies based on English-language content
> > about non-English speaking areas of the world.  Similarly, Twitter has
> > become the go-to dataset for social media studies even as Facebook, Weibo,
> > VK, Viber, WhatsApp, etc, offer better access to certain communities or
> > modalities, but don't offer the same easy firehose API and tool ecosystem,
> > so researchers go with the easier path rather than focusing on which
> > platform might offer the best access to the the community or phenomena they
> > are trying to measure.
> >
> > This is something that needs a great deal more attention in the
> > quantitative and "big data" spaces.  Two of my Foreign Policy columns on
> > this topic may be of interest re just how much our understanding of the
> > world is skewed through this fixation on English Western sources.  My most
> > recent one, out this afternoon, explores how our understanding of global
> > terrorism trends is based almost exclusively on English-language news
> > coverage and how that has influenced our understanding of trends:
> >
> > http://foreignpolicy.com/2015/04/15/why-we-cant-just-read-english-newspapers-to-understand-terrorism-big-data/
> >
> > http://www.foreignpolicy.com/articles/2014/09/26/why_big_data_missed_the_early_warning_signs_of_ebola
> >
> >
> > ~K
> >
> >
> >
> > L [mailto:air-l-bounces at listserv.aoir.org] On Behalf Of Matthew Weber
> >> Sent: Thursday, April 09, 2015 11:08 PM
> >> To: air-l at listserv.aoir.org
> >> Subject: [Air-L] "Big Data" Tools
> >>
> >> AIR’ers:
> >>
> >> I’m working on compiling a rough list of tools and training modules that
> >> are useful for working with large-scale datasets (“Big Data”) and training.
> >> Essentially, I’m trying to build *something* that I can point newbies /
> >> graduate students / to when they say “I want to do Big Data”. I’ve got a
> >> rough list of coursera / edX / blog modules, but would welcome suggestions.
> >> I’m happy to share back the results.
> >>
> >> (I did try to check the AIR archive, but was unable to access).
> >>
> >> Thanks!
> >> Matt
> >>
> >>
> >>
> >>
> >> Matthew S. Weber
> >> Assistant Professor
> >> School of Communication and Information
> >> Rutgers University
> >>
> >> (ph): 848-932-8718
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> The Air-L at listserv.aoir.org mailing list is provided by the Association
> >> of Internet Researchers http://aoir.org Subscribe, change options or
> >> unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >>
> >> Join the Association of Internet Researchers:
> >> http://www.aoir.org/
> >> _______________________________________________
> >> The Air-L at listserv.aoir.org mailing list
> >> is provided by the Association of Internet Researchers http://aoir.org
> >> Subscribe, change options or unsubscribe at:
> >> http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >>
> >> Join the Association of Internet Researchers:
> >> http://www.aoir.org/
> > _______________________________________________
> > The Air-L at listserv.aoir.org mailing list
> > is provided by the Association of Internet Researchers http://aoir.org
> > Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> >
> > Join the Association of Internet Researchers:
> > http://www.aoir.org/
> 
> _______________________________________________
> The Air-L at listserv.aoir.org mailing list
> is provided by the Association of Internet Researchers http://aoir.org
> Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org
> 
> Join the Association of Internet Researchers:
> http://www.aoir.org/
>