Apply textcat only to sentences of a document?

Hi! I have a textcat model that has been trained on annotated sentences that were split out of larger docs. Is there a way to use the model such that it works on each sentence of a doc, rather than the same doc? It doesn't seem to make sense to use the model directly, since I would be calling it twice (once for sentence segmentation and then on each sentence).

Yes, there are a couple of ways you could do that. It's pretty much a question of spaCy usage rather than Prodigy. Fundamentally, what you want to be doing is something like this:


import spacy

nlp = spacy.load("/path/to/model")
# First, we'll take the textcat out of the pipeline,
# so we can call it separately.
textcat = nlp.get_pipe("textcat")
nlp.disable_pipes(["textcat"])

# We run the remaining pipeline to get our Doc with sentences.
doc = nlp("Some text. A second sentence.")
# Now we want to just run the textcat, over the sentences.
# One way to do this is to convert the sentences into Doc objects.
docs = [sent.as_doc() for sent in doc.sents]
# Now we can pass those through the textcat.
docs = list(textcat.pipe(docs))

You can't really wrap this up as a pipeline component because the component needs to split the Doc, rather than just setting annotations. But you can still just put a function around the logic that you can call with the nlp object and a sequence of texts, and that should be convenient enough.