Apply textcat only to sentences of a document?

oneextrafact · June 13, 2020, 8:14pm

Hi! I have a textcat model that has been trained on annotated sentences that were split out of larger docs. Is there a way to use the model such that it works on each sentence of a doc, rather than the same doc? It doesn't seem to make sense to use the model directly, since I would be calling it twice (once for sentence segmentation and then on each sentence).

honnibal · June 16, 2020, 12:07am

Yes, there are a couple of ways you could do that. It's pretty much a question of spaCy usage rather than Prodigy. Fundamentally, what you want to be doing is something like this:


import spacy

nlp = spacy.load("/path/to/model")
# First, we'll take the textcat out of the pipeline,
# so we can call it separately.
textcat = nlp.get_pipe("textcat")
nlp.disable_pipes(["textcat"])

# We run the remaining pipeline to get our Doc with sentences.
doc = nlp("Some text. A second sentence.")
# Now we want to just run the textcat, over the sentences.
# One way to do this is to convert the sentences into Doc objects.
docs = [sent.as_doc() for sent in doc.sents]
# Now we can pass those through the textcat.
docs = list(textcat.pipe(docs))

You can't really wrap this up as a pipeline component because the component needs to split the Doc, rather than just setting annotations. But you can still just put a function around the logic that you can call with the nlp object and a sequence of texts, and that should be convenient enough.

Topic		Replies	Views
Access to/manipulate sent.cat within TextClassifier class? usage , textcat , spacy	4	947	February 21, 2019
breaking down texts to sentences for textcat textcat , best-practices	2	336	December 13, 2023
How to use a (sentence targeted) textcat model together with the core model textcat , spacy	2	1342	November 28, 2017
Combining NER with text classification usage , ner , textcat	10	6898	March 20, 2024
textcat.teach splitting text stream textcat , solved	2	537	May 23, 2018

Apply textcat only to sentences of a document?

Related topics