[Air-L] Chinese language social media data mining tools

Giuseppe A. Veltri ga.veltri at gmail.com
Tue May 9 10:45:09 PDT 2017


While there are a few options, mainly using R packages, to get data
from Chinese social media such as Weibo (its API is similar to Twitter
as you would expect), if your intention is to do text mining on the
content, the latter is great pain in the neck. I have been battling on
this for a while but in short:
- You need a stemmer/tokenizer that works with Mandarin. The best was
ICTCLAS developed by the Chinese Academy of Science. There are still a
few open source versions circulating online but the most updated one
has become proprietary. 
- Once you have the corpus pre-processed than you can use NLP packages
for topic modelling but there are some segmentation issues to take care
of. You can't really use out of the box solutions such as wordstat or
the like because they just produce nonsense.

Overall, not a walk in the park for sure.

Good luck,
GV
On Tue, 2017-05-09 at 16:58 +0100, Helen Kennedy wrote:
> Hello clever AOIR folks
> 
> Asking for postgrad students: any recommendations of social media
> data
> mining tools that work on Chinese social media platforms / with
> Chinese
> languages?
> 
> Thanks!
> 
> Helen
> 
> 
-- 
----------------
Giuseppe A. Veltri
work email: giuseppe.veltri at unitn.it
Twitter: @gaveltri
ResearchGate




More information about the Air-L mailing list