Topic Modelling with text classification

I have to find out which is the topic related to a sentence in a corpus counting about 10.000 sentences, so I decided to use the text classification feature.

I have 9 non exclusive categories to identify for each sentence and I started by manually annotating the sentences with the following command:

prodigy textcat.manual news_topics ./prodigyInput.jsonl --label no-poverty-zero-hunger,good-health-well-being-life-quality,quality-education,equality-decent-work-inclusion,sustainable-consumption-production-industry-innovation,climate-land-water-clean-energy,peace-justice,partnerships-for-the-goals,sustainable-development

You can see in the screenshot below a typical sentence, in length and complexity (sorry for the italian).

I have already annotated 1000 sentences and I have updated the standard italian model ( it_core_news_sm-2.3.0 ) with the following command:

prodigy textcat.batch-train news_topics it_core_news_sm-2.3.0 --output ./models/

Then I tried to use the new model to infer the topic for a set of test sentences using this piece of code:

model = spacy.load("./models")

sentences = [
"sentence_1",
"sentence_2",
...
]

for sentence in sentences:
doc = model(sentence)
for k, v in doc.cats.items():
if v > 0.60:
print(sentence, " type: ", k, " score:", v)

Unfortunately almost all the categories return a score of about 0.9 for each sentence.

The I have some questions:

  1. Is the described procedure correct? Do you have some suggestion to change the procedure or some other recipe to use (e.g. using textcat.teach in order to refine the trained model)?
  2. How many sentences do you think I have to annotate before getting some clear discrimination between categories?
  3. Is there some prodigy tool that I can use that can help me to monitor the training of the model while I change or increase the set of annotated sentences.
  4. I think it could be helpful during the annotation phase so that I can select the best sentences in order to make the model better...

Thanks

Hi! 1000 examples is definitely a good start, assuming they're representative. This should typically give you an idea of whether your model is learning something or if there are any problems.

How many examples are you evaluating on? If you're just looking at a few selected sentences, it can be difficult to get some meaningful insights, so you typically want to perform a more stable evaluation. For example, use a dedicated evaluation set that doesn't change between runs and compare the scores as you train your model and collect more data.

To run quick experiments, you can use the prodigy train command to train a model from your Prodigy dataset: https://prodi.gy/docs/recipes#train That's newer and more flexible than the old textcat.batch-train. You can also run data-to-spacy to convert your annotations to use with spacy train.

From what you describe, it sounds like your model has learned to pretty much always predict a high score, so you probably want to do some data debugging before you start with other workflows with your model in the loop. So I'd suggest:

  • Make sure you hold back enough evaluation examples, train with prodigy train and check the results.
  • If the scores are low, double-check your annotations in the dataset and make sure you have a representative distribution of categories in there. You can also run data-to-spacy and inspect the merged examples.