Topic Modelling on Chinese text


Just a really broad question here... can drill down further if need be.

  1. What are the avenues available for Prodigy to shine in areas of Unsupervised Topic Modelling in Chinese languages (or other languages other than English)?

  2. If Prodigy/spaCy is not optimal to be used directly, then what are the possible processing or ML techniques to make the data more suited for Prodigy's use-case?

  3. Also, are there any tips/suggestions to topic model conversational chats?


Hi Jason.

I want to be a little bit careful on giving advice on Chinese because it's a language that I myself do not speak. I am aware of some tools/tricks that might help, which I'll share below, but I cannot properly judge how well they might work.

That said, here are some ideas.

  1. Prodigy does not do topic modeling out of the box because it's mainly concerned with adding annotations. That said, a typical flow is that a topic model would be used at the beginning of a project to inspire some labels worth predicting. Then, Prodigy can be used to attach these labels to the texts of interest.
  2. There are embeddings models out there that support Chinese. The sentence-transformers project lists a few models here and LaBSE in particular might suffice. These models can be used together with the BERTopic package to do topic research or they can be used as features for bulk labelling. If you're interested in the bulk labelling approach, please check our video on the topic here:
  1. The main issue with conversational texts is that an utterance on its own lacks context. When a person says "Good evening!" at the beginning of a conversation then this could be interpreted as a "greeting" intent, while it can also mean "goodbye" if it is uttered at the end. So I wouldn't just spend time looking at single utterances, but also at the full conversations to understand what the user wants the chatbot to understand.