Hi! I have a textcat model that has been trained on annotated sentences that were split out of larger docs. Is there a way to use the model such that it works on each sentence of a doc, rather than the same doc? It doesn't seem to make sense to use the model directly, since I would be calling it twice (once for sentence segmentation and then on each sentence).
Yes, there are a couple of ways you could do that. It's pretty much a question of spaCy usage rather than Prodigy. Fundamentally, what you want to be doing is something like this:
import spacy nlp = spacy.load("/path/to/model") # First, we'll take the textcat out of the pipeline, # so we can call it separately. textcat = nlp.get_pipe("textcat") nlp.disable_pipes(["textcat"]) # We run the remaining pipeline to get our Doc with sentences. doc = nlp("Some text. A second sentence.") # Now we want to just run the textcat, over the sentences. # One way to do this is to convert the sentences into Doc objects. docs = [sent.as_doc() for sent in doc.sents] # Now we can pass those through the textcat. docs = list(textcat.pipe(docs))
You can't really wrap this up as a pipeline component because the component needs to split the
Doc, rather than just setting annotations. But you can still just put a function around the logic that you can call with the
nlp object and a sequence of texts, and that should be convenient enough.