Hi,
I’m a bit confused about the best approach to manually annotating documents for text classification. For NER I used ner.manual
to manually annotate examples and then trained with --no-missing
to make sure that all other tokens are not considered entities. But I couldn’t find a corresponding recipe for textcat
. What I did was to use the mark
recipe with --view-id classification
. There was no --no-missing
for textcat.batch-train
though. My question is: why is there no textcat.manual
and no --no-missing
option for textcat.batch-train
? And I also noticed I could do ner with mark
recipe as well, by choosing --view-id ner
or ner_manual
. What’s the difference between choosing mark
recipe with either ner
or ner_manual
vs ner.teach
and ner.manual
recipes?
The mark
recipe takes whatever comes in and will render it with a given interface – that's it. It doesn't preprocess the text in any way, doesn't apply suggestions from a model, doesn't update anything in the loop etc. So it's also super agnostic to what you're doing there – all it knows is that you want to show some data in the app.
The ner.teach
recipe on the other hand it specifically designed for named entity recognition with a model in the loop. It expects a spaCy model that predicts named entities, gets the predictions, adds highlighted spans to the incoming examples and updates the model with the answers. You can see a slightly simplified version with explanations in our prodigy-recipes
repo:
The ner.manual
recipe doesn't update anything in the loop, but it makes sure that the examples that come in are pre-tokenized and that existing annotated spans are aligned with the tokens. This allows faster highlighting, because the selection can "snap" to the token boundaries.
To answer your initial question:
If you're labelling entities, you usually want to highlight spans of texts within a text. If you're assinging top-level categories to a text, that usually doesn't require a specific mechanism, you don't need to pre-tokenize the text or do anything else specific to the task. How you solve it is also more flexible: you can either stream in each example for a given label, or use the choice
interface to select one or more categories at once. None of these require custom, textcat-specific logic, which is why we currently don't have a dedicated recipe for that. But maybe it'd make sense to offer a slightly modified version of mark
as textcat.manual
, just for consistency.
The reason there's no --no-missing
flag on textcat.batch-train
is that in spaCy, categories were assumed to be mutually inclusive by default. In the latest version of spaCy, you'll be able to customise this behaviour more easily, so we'll also be adding support for that to Prodigy.
Thanks a lot for the answer so if I understand it correctly ner.manual
is a variation of mark
with --view-id ner_manual
and simply makes the task of manual annotation easier, since it helps with highlighting the correct token boundaries?
But maybe it’d make sense to offer a slightly modified version of
mark
astextcat.manual
, just for consistency.
I guess that would help new people like me to find the correct recipe, because I just looked at textcat
recipes and was a bit confused when I didn't find it there
The reason there’s no
--no-missing
flag ontextcat.batch-train
is that in spaCy, categories were assumed to be mutually inclusive by default. In the latest version of spaCy, you’ll be able to customise this behaviour more easily, so we’ll also be adding support for that to Prodigy.
Oh, that makes sense now. I was actually wondering why the model returned probability for each category and why they didn't sum up to 100%. But I see that it's more flexible that way. I'm looking forward to testing the newest version with Prodigy then
Yes, you can see what ner.manual
does under the hood here:
It might also help to think of the mark
recipe as kind of the most basic recipe: data comes in and is rendered with an interface. That's the minimum you need for any given recipe. More task-specific recipes can also implement other things: for example, data transformations, a model or other process that adds suggestions to the data, an update callback that's executed when new answers are received, and so on.
Ok, thanks, I think I understand it better now