Topic Modelling on Chinese text

jsnleong · November 10, 2022, 2:41am

Hi!

Just a really broad question here... can drill down further if need be.

What are the avenues available for Prodigy to shine in areas of Unsupervised Topic Modelling in Chinese languages (or other languages other than English)?
If Prodigy/spaCy is not optimal to be used directly, then what are the possible processing or ML techniques to make the data more suited for Prodigy's use-case?
Also, are there any tips/suggestions to topic model conversational chats?

Thanks!

koaning · November 11, 2022, 1:46pm

Hi Jason.

I want to be a little bit careful on giving advice on Chinese because it's a language that I myself do not speak. I am aware of some tools/tricks that might help, which I'll share below, but I cannot properly judge how well they might work.

That said, here are some ideas.

Prodigy does not do topic modeling out of the box because it's mainly concerned with adding annotations. That said, a typical flow is that a topic model would be used at the beginning of a project to inspire some labels worth predicting. Then, Prodigy can be used to attach these labels to the texts of interest.
There are embeddings models out there that support Chinese. The sentence-transformers project lists a few models here and LaBSE in particular might suffice. These models can be used together with the BERTopic package to do topic research or they can be used as features for bulk labelling. If you're interested in the bulk labelling approach, please check our video on the topic here:

The main issue with conversational texts is that an utterance on its own lacks context. When a person says "Good evening!" at the beginning of a conversation then this could be interpreted as a "greeting" intent, while it can also mean "goodbye" if it is uttered at the end. So I wouldn't just spend time looking at single utterances, but also at the full conversations to understand what the user wants the chatbot to understand.

Topic		Replies	Views
Prodigy Tutorial Video: Bulk Labelling!	4	632	July 15, 2022
Topic modeling with prodigy usage , discussion	2	1024	June 11, 2018
Support for Japanese annotation in Prodigy ner , spacy	1	908	September 2, 2019
Topic Modelling with text classification usage , textcat	1	617	November 30, 2020
Can it work on Traditional Chinese or Simplified Chinese? usage	1	843	September 25, 2018

Topic Modelling on Chinese text

Related topics