textcat.manual?

Hi,
I’m a bit confused about the best approach to manually annotating documents for text classification. For NER I used ner.manual to manually annotate examples and then trained with --no-missing to make sure that all other tokens are not considered entities. But I couldn’t find a corresponding recipe for textcat. What I did was to use the mark recipe with --view-id classification. There was no --no-missing for textcat.batch-train though. My question is: why is there no textcat.manual and no --no-missing option for textcat.batch-train? And I also noticed I could do ner with mark recipe as well, by choosing --view-id ner or ner_manual. What’s the difference between choosing mark recipe with either ner or ner_manual vs ner.teach and ner.manual recipes?

The mark recipe takes whatever comes in and will render it with a given interface – that's it. It doesn't preprocess the text in any way, doesn't apply suggestions from a model, doesn't update anything in the loop etc. So it's also super agnostic to what you're doing there – all it knows is that you want to show some data in the app.

The ner.teach recipe on the other hand it specifically designed for named entity recognition with a model in the loop. It expects a spaCy model that predicts named entities, gets the predictions, adds highlighted spans to the incoming examples and updates the model with the answers. You can see a slightly simplified version with explanations in our prodigy-recipes repo:

The ner.manual recipe doesn't update anything in the loop, but it makes sure that the examples that come in are pre-tokenized and that existing annotated spans are aligned with the tokens. This allows faster highlighting, because the selection can "snap" to the token boundaries.

To answer your initial question:

If you're labelling entities, you usually want to highlight spans of texts within a text. If you're assinging top-level categories to a text, that usually doesn't require a specific mechanism, you don't need to pre-tokenize the text or do anything else specific to the task. How you solve it is also more flexible: you can either stream in each example for a given label, or use the choice interface to select one or more categories at once. None of these require custom, textcat-specific logic, which is why we currently don't have a dedicated recipe for that. But maybe it'd make sense to offer a slightly modified version of mark as textcat.manual, just for consistency.

The reason there's no --no-missing flag on textcat.batch-train is that in spaCy, categories were assumed to be mutually inclusive by default. In the latest version of spaCy, you'll be able to customise this behaviour more easily, so we'll also be adding support for that to Prodigy.

Thanks a lot for the answer :slight_smile: so if I understand it correctly ner.manual is a variation of mark with --view-id ner_manual and simply makes the task of manual annotation easier, since it helps with highlighting the correct token boundaries?

But maybe it’d make sense to offer a slightly modified version of mark as textcat.manual , just for consistency.

I guess that would help new people like me to find the correct recipe, because I just looked at textcat recipes and was a bit confused when I didn't find it there :smile:

The reason there’s no --no-missing flag on textcat.batch-train is that in spaCy, categories were assumed to be mutually inclusive by default. In the latest version of spaCy, you’ll be able to customise this behaviour more easily, so we’ll also be adding support for that to Prodigy.

Oh, that makes sense now. I was actually wondering why the model returned probability for each category and why they didn't sum up to 100%. But I see that it's more flexible that way. I'm looking forward to testing the newest version with Prodigy then :slight_smile:

Yes, you can see what ner.manual does under the hood here:

It might also help to think of the mark recipe as kind of the most basic recipe: data comes in and is rendered with an interface. That's the minimum you need for any given recipe. More task-specific recipes can also implement other things: for example, data transformations, a model or other process that adds suggestions to the data, an update callback that's executed when new answers are received, and so on.

Ok, thanks, I think I understand it better now :slight_smile:

1 Like