I have to find out which is the topic related to a sentence in a corpus counting about 10.000 sentences, so I decided to use the text classification feature.
I have 9 non exclusive categories to identify for each sentence and I started by manually annotating the sentences with the following command:
prodigy textcat.manual news_topics ./prodigyInput.jsonl --label no-poverty-zero-hunger,good-health-well-being-life-quality,quality-education,equality-decent-work-inclusion,sustainable-consumption-production-industry-innovation,climate-land-water-clean-energy,peace-justice,partnerships-for-the-goals,sustainable-development
You can see in the screenshot below a typical sentence, in length and complexity (sorry for the italian).
I have already annotated 1000 sentences and I have updated the standard italian model ( it_core_news_sm-2.3.0 ) with the following command:
prodigy textcat.batch-train news_topics it_core_news_sm-2.3.0 --output ./models/
Then I tried to use the new model to infer the topic for a set of test sentences using this piece of code:
model = spacy.load("./models")
sentences = [
"sentence_1",
"sentence_2",
...
]
for sentence in sentences:
doc = model(sentence)
for k, v in doc.cats.items():
if v > 0.60:
print(sentence, " type: ", k, " score:", v)
Unfortunately almost all the categories return a score of about 0.9 for each sentence.
The I have some questions:
- Is the described procedure correct? Do you have some suggestion to change the procedure or some other recipe to use (e.g. using textcat.teach in order to refine the trained model)?
- How many sentences do you think I have to annotate before getting some clear discrimination between categories?
- Is there some prodigy tool that I can use that can help me to monitor the training of the model while I change or increase the set of annotated sentences.
- I think it could be helpful during the annotation phase so that I can select the best sentences in order to make the model better...
Thanks