i am having a problem with the textcat.teach annotations. i am trying to classify different sports and have used —labels FOOTBALL, CRICKET, ATHLETICS, TENNIS but only one of them at a time is available to me to classify. i know that you recommend only using one label at a time but i wanted to try multiple and classify accordingly. is there flag i have not set?
The textcat.teach
recipe will train the text classifier with a model in the loop and based on binary decisions – so just like in ner.teach
, Prodigy will get suggestions from the model and ask you about one annotation at a time. The --label
argument is used to filter out the labels to ask you about. (For example, if your model also had the category BASEBALL available, Prodigy wouldn’t show you suggestions for that.)
There’s currently no textcat.manual
recipe or manual interface for text classification – instead, we usually recommend making several passes over your data and doing one label at a time. One reason for that is that annotation is often much easier if your brain gets to focus on one concept at a time. In your case, the categories are more closely related than, say, SPORTS and POLITICS. But still, focusing on “football or not” is much easier than having to keep track of all possible option.
The other reason is that you’ll be able to create much more meaningful training data for the model. The categories of spaCy’s text classifier are not mutually exclusive by default – so if you only annotate each example once and apply the “correct” label, your model won’t learn much about examples that are not about football. By saying yes and no to each label, you’ll be able to collect both positive and negative examples for all categories. (Let’s say you have 1000 texts and your 4 labels. Binary decisions are quick and if you get into a good flow, you can easily get to 1 second per annotation decision. So you can collect 4000 annotations in ~1 hour and you’ll have training data for each label on each text. If you do all labels at once, you can easily spend the same amount of time on the task and only end up with 1000 examples and no negatives.)
That said, there are always situations where you just want to collect gold-standard labels by selecting them. In that case, a custom recipe using the choice
interface might be a better option. You can check out the live demo here and find an example of the recipe code in the custom recipes workflow. To allow multiple selections, you can set 'choice_style': 'multiple'
in the 'config'
dictionary returned by your recipe.
Your input data would then look something like this:
{
"text": "Some text about sports",
"options": [
{"id": 0, "text": "FOOTBALL"},
{"id": 1, "text": "CRICKET"},
{"id": 2, "text": "ATHLETICS"},
{"id": 3, "text": "TENNIS"}
]
}
You can easily create those programmatically by wrapping the stream in a function that adds the "options"
key – see the example for the full recipe code. You’ll also find more details on the choice format in your PRODIGY_README.html
, available for download with Prodigy.