Binary "pre-model" for faster annotation


Is there a way I can train a binary accept/reject textcat model that finds relevant sentences (the relevant category that we are looking for is rare in the whole dataset)?
Could the model then propose accepts that I can pipe into ner.manual for more detailed annotation? I think, this way, I would be able to get more (relevant) annotations faster.

Something comparable was asked here: Span annotation with ner.manual -- how to make use of ner.teach

Thanks in advance!

That's a nice idea and should be possible! :slightly_smiling_face: Assuming you've trained your text classification model, you could write a custom recipe with a stream like this:

import spacy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens

# in your recipe:
nlp = spacy.load("./your_textcat_model")

def get_suggestions_from_textcat():
    stream = JSONL(source)
    for eg in stream:
        doc = nlp(eg["text"])
        # Use the doc.cats and their scores to decide whether
        # you want to send out the example or not
        if doc.cats["RELEVANT"] > 0.5:
            yield eg

stream = get_suggestions_from_textcat()
# Add "tokens" to each example so it can be rendered in the
# "ner_manual" interface
stream = add_tokens(nlp, stream)

In the example above, it's just checking if the RELEVANT text category has a score of over 0.5. But depending on the text classifier you've trained, you could also come up with a more sophisticated logic here. In the annotation UI, you're now only going to see the texts you selected based on the text categories.