textcat.manual?

katarkor · March 22, 2019, 1:42pm

Hi,
I’m a bit confused about the best approach to manually annotating documents for text classification. For NER I used ner.manual to manually annotate examples and then trained with --no-missing to make sure that all other tokens are not considered entities. But I couldn’t find a corresponding recipe for textcat. What I did was to use the mark recipe with --view-id classification. There was no --no-missing for textcat.batch-train though. My question is: why is there no textcat.manual and no --no-missing option for textcat.batch-train? And I also noticed I could do ner with mark recipe as well, by choosing --view-id ner or ner_manual. What’s the difference between choosing mark recipe with either ner or ner_manual vs ner.teach and ner.manual recipes?

ines · March 22, 2019, 2:59pm

The mark recipe takes whatever comes in and will render it with a given interface – that's it. It doesn't preprocess the text in any way, doesn't apply suggestions from a model, doesn't update anything in the loop etc. So it's also super agnostic to what you're doing there – all it knows is that you want to show some data in the app.

The ner.teach recipe on the other hand it specifically designed for named entity recognition with a model in the loop. It expects a spaCy model that predicts named entities, gets the predictions, adds highlighted spans to the incoming examples and updates the model with the answers. You can see a slightly simplified version with explanations in our prodigy-recipes repo:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_teach.py

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.models.ner import EntityRecognizer
from prodigy.models.matcher import PatternMatcher
from prodigy.components.preprocess import split_sentences
from prodigy.components.sorters import prefer_uncertain
from prodigy.util import combine_models, split_string
import spacy
from typing import List, Optional


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "ner.teach",
    dataset=("The dataset to use", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),

This file has been truncated. show original

The ner.manual recipe doesn't update anything in the loop, but it makes sure that the examples that come in are pre-tokenized and that existing annotated spans are aligned with the tokens. This allows faster highlighting, because the selection can "snap" to the token boundaries.

To answer your initial question:

If you're labelling entities, you usually want to highlight spans of texts within a text. If you're assinging top-level categories to a text, that usually doesn't require a specific mechanism, you don't need to pre-tokenize the text or do anything else specific to the task. How you solve it is also more flexible: you can either stream in each example for a given label, or use the choice interface to select one or more categories at once. None of these require custom, textcat-specific logic, which is why we currently don't have a dedicated recipe for that. But maybe it'd make sense to offer a slightly modified version of mark as textcat.manual, just for consistency.

The reason there's no --no-missing flag on textcat.batch-train is that in spaCy, categories were assumed to be mutually inclusive by default. In the latest version of spaCy, you'll be able to customise this behaviour more easily, so we'll also be adding support for that to Prodigy.

katarkor · March 25, 2019, 10:30am

Thanks a lot for the answer so if I understand it correctly ner.manual is a variation of mark with --view-id ner_manual and simply makes the task of manual annotation easier, since it helps with highlighting the correct token boundaries?

But maybe it’d make sense to offer a slightly modified version of mark as textcat.manual , just for consistency.

I guess that would help new people like me to find the correct recipe, because I just looked at textcat recipes and was a bit confused when I didn't find it there

The reason there’s no --no-missing flag on textcat.batch-train is that in spaCy, categories were assumed to be mutually inclusive by default. In the latest version of spaCy, you’ll be able to customise this behaviour more easily, so we’ll also be adding support for that to Prodigy.

Oh, that makes sense now. I was actually wondering why the model returned probability for each category and why they didn't sum up to 100%. But I see that it's more flexible that way. I'm looking forward to testing the newest version with Prodigy then

ines · March 25, 2019, 11:01am

Yes, you can see what ner.manual does under the hood here:

github.com

explosion/prodigy-recipes/blob/master/ner/ner_manual.py

from typing import List, Optional
import spacy
import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
from prodigy.models.matcher import PatternMatcher
from prodigy.util import split_string


# Helper function for removing token information from examples
# before they're placed in the database. Used if character highlighting is enabled.
def remove_tokens(answers):
    for eg in answers:
        del eg["tokens"]
        if "spans" in eg:
            for span in eg["spans"]:
                del span["token_start"]
                del span["token_end"]
    return answers

This file has been truncated. show original

It might also help to think of the mark recipe as kind of the most basic recipe: data comes in and is rendered with an interface. That's the minimum you need for any given recipe. More task-specific recipes can also implement other things: for example, data transformations, a model or other process that adds suggestions to the data, an update callback that's executed when new answers are received, and so on.

katarkor · March 29, 2019, 9:43am

Ok, thanks, I think I understand it better now

Topic		Replies	Views
Span annotation with ner.manual -- how to make use of ner.teach ner	6	860	December 3, 2019
Is it possible to do NER and Textcat Annotation together? ner , textcat	4	38	October 28, 2024
first annotation - can I switch mid-way from ner.manual to textcat? usage , ner , textcat	4	517	July 13, 2021
Two Questions on Teach recipes usage , ner , textcat , solved	5	745	January 27, 2020
Named Entities(manual) usage , ner , solved	4	803	May 11, 2018

textcat.manual?

Related topics