[Air-L] New CDT report + event tomorrow May 24, 2023 (10 am ET) - Can Large Language Models Analyze Non-English Content?

Dhanaraj Thakur dthakur at cdt.org
Tue May 23 06:14:44 PDT 2023


Hi everyone,

We are excited to announce the publication of our new CDT research 
report, “Lost in Translation: Large Language Models in Non-English 
Content Analysis 
<https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/>.” 



The report explains the capabilities of a new AI technology called 
“multilingual language models” that technology companies claim can 
understand content in over 100 languages by extrapolating linguistic 
patterns from high-resource languages. We further describe how these 
models work, and argue that they have significant limitations 
<https://cdt.org/press/cdt-finds-key-shortcomings-when-large-language-models-analyze-non-english-languages/>, 
particularly in “low-resource languages” — languages for which AI 
developers have little text data available to train AI models, 
regardless of the number of speakers around the world.


Companies, researchers, civil society advocates, and policymakers should 
be aware of these limitations, as they can create real barriers to 
information access and equitable online participation for individuals. 
We also offer guidance on how to help close the gap between companies’ 
ability to moderate content in English versus the world’s other 7,000 
languages.


The full report is available on CDT’s website, along with executive 
summaries 
<https://cdt.org/insights/lost-in-translation-large-language-models-in-non-english-content-analysis/> 
in Spanish, French, and 
Arabic<https://cdt.org/press/cdt-finds-key-shortcomings-when-large-language-models-analyze-non-english-languages/>. 
Tomorrow, we’ll discuss the paper at an event called “Mind the Gap” 
<https://cdt.org/event/mind-the-gap-can-large-language-models-analyze-non-english-content/>(see 
below for more details) — we hope you can join us!


Finally, we have an article out in WIRED 
<https://www.wired.com/story/content-moderation-language-artificial-intelligence/>about 
how social media companies specifically use multilingual language models 
to moderate content in languages other than English.


Feel free to share, and let us know if you have any questions or feedback.

take care,

Dhanaraj

On 5/10/23 4:45 PM, Dhanaraj Thakur wrote:
>
> Hi everyone,
>
> Please see details below about an online event CDT is hosting on May 
> 24 at 10am ET. This will follow the upcoming launch of our research 
> report "Lost in Translation: Large Language Models in Non-English 
> Content Analysis." In the meantime please RSVP for our event here 
> <https://www.eventbrite.com/e/mind-the-gap-can-large-language-models-analyze-non-english-content-tickets-631677633807>. 
>
>
> thanks,
>
> Dhanaraj
>
>
> *Mind the Gap: Can Large Language Models Analyze Non-English Content?*
>
> *Time: *10:00 AM EDT
>
> *Date: *May 24, 2023
>
> From search engines to social media to hiring algorithms, automated 
> systems increasingly shape people’s online experiences worldwide. 
> Despite internet users speaking thousands of languages, most of these 
> systems are primarily trained using English-language data. Computer 
> scientists claim that they have found a solution to this linguistic 
> gap in a new technology called “multilingual language models.” 
> Multilingual language models work similarly to the language models 
> that power new generative systems like ChatGPT, but instead of being 
> trained on millions of examples of text in mostly one language, they 
> pull text from dozens or hundreds of languages and learn connections 
> between them.
>
> But do these multilingual language models work as well as companies 
> say they do? A new technical primer 
> <https://cdt.org/insights/languages-left-behind-automated-content-analysis-in-non-english-languages/>by 
> CDT shows that these systems may have key shortcomings which only 
> compound when used to analyze non-English languages.
>
> This panel will convene NLP researchers building systems and 
> digitizing languages spoken by millions of people in India and South 
> Africa, content policy experts evaluating the impact these systems 
> have on users’ rights, and CDT’s research and policy team members for 
> a deep dive into how these multilingual language models work, what 
> their capabilities and limitations are, how they can be improved, and 
> what’s at stake when these systems fall short.
>
> Speakers:
>
>   * Aliya Bhatia <https://cdt.org/staff/aliya-bhatia/>, Center for
>     Democracy & Technology
>   * Gabriel Nicholas <https://cdt.org/staff/gabriel-nicholas/>, Center
>     for Democracy & Technology
>   * Dr Monojit Choudhury
>     <https://www.microsoft.com/en-us/research/people/monojitc/>,
>     Turing Institute
>   * Dr Vukosi Marivate
>     <https://africa.harvard.edu/people/vukosi-marivate>, Masakhane
>   * Jacqueline Rowe
>     <https://www.gp-digital.org/team/jacqueline-rowe/>, Global
>     Partners Digital
>
> *RSVP here* 
> <https://www.eventbrite.com/e/mind-the-gap-can-large-language-models-analyze-non-english-content-tickets-631677633807>
>
>
>
> -- 
>
> *Dhanaraj Thakur* (he/him) | Research Director
> Center for Democracy & Technology |*cdt.org <https://cdt.org/>*
> *E:* dthakur at cdt.org | *P:* +1 202 407 8849
>
-- 

*Dhanaraj Thakur* (he/him) | Research Director
Center for Democracy & Technology |*cdt.org <https://cdt.org/>*
*E:* dthakur at cdt.org | *P:* +1 202 407 8849



More information about the Air-L mailing list