I have a dataset with multiple labels which I have annotated.
When I run textcat.batch train - would I expect the performance of my labels to be the same as training on a data set with just a single label?
For example
dataset A 2 labels: HOTDOG & NOTHOTDOG
dataset B: 1 label HOTDOG
dataset C 1 label NOTHOTDOG
would running:
textcat.batchtrain model datasetA have the same performance as training 2 separate models on datasets B & C and combining their outputs?
For separate textcat.batch labels is each label trained separately? or is there any ‘leak’
Currently working on comparing these empirically - but some insight and tips would be great
Hi! The built-in textcat recipes use spaCy’s text classifier implementation, which currently expects the labels to be not mutually exclusive. So in theory, an example could be both hotdog and not hotdog. Of course, for a binary classification task like your example, this should be easy to work around by annotating and training only one label, HOTDOG.
spaCy v2.1 introduced the option to make the labels mutually exclusive – so in the next update of Prodigy, you’ll be able to specify this when you annotate and train a model. Depending on what you want to do, you might also find that a different text classification implementation just works better on your problem. In that case, you can export the data from Prodigy and train your model separately, or plug it in via a custom recipe to annotate with a model in the loop. Just make sure your model implementation is sensitive enough to updates.
Okay great thanks for clarifying
Was just going through the docs again, does adding the label flag to batch-train override this behaviour in the current version?
Like so: pgy textcat.batch-train data models --label 'label'