Access to/manipulate sent.cat within TextClassifier class?

I’m attempting to write a custom recipe for arbitrary sentence-level classification (think sentiment analysis, but on a per-sentence basis, nothing really controversial or exotic). I’ve been experimenting with other examples, and I know that such a thing can be achieved using custom pipeline components, though I’d like to use prodigy/spacy’s native integrations if possible to limit variation during the model development process.

The crux of the issue as I understand is as follow:

  • prodigy/components/recipes/textcat.py looks like it should serve as a useful recipe template for some custom text classification text
  • prodigy/components/recipes/textcat.py manipulates an instance of TextClassifier, which appends arbitrary labels at the document level via the doc.cats attribute
  • Ok cool, so why not just tweak TextClassifier to append at the sentence level instead? As far as I can tell however, TextClassifier is a cython-compiled (probably butchering this…) thing?
  • Because of this, I have no obvious way of altering TextClassifier as I lack the source code. Reverse engineering this is a time-intensive alternative I suppose, but one something I’d rather avoid.

I’ve found the following threads which are peripheral to my problem. They’ve proven useful, but ultimately do not do what I’d like:

Any pointers on this would be much appreciated!

Thanks

Hi,

I think it might be useful to separate two problems out:

  1. How to annotate and check progress for the sentence level task in Prodigy?
  2. How to train and use a per-sentence-classification model in spaCy?

I think the simplest answer for 1) is to split your data up so that you have a feed of single sentences. You can feed Prodigy with a generator function or by piping forward the output of another process, so it should be pretty easy to give yourself a feed with one sentence per example.

For 2), the issue is that the doc.cats data expects the label to apply to the whole document. There’s no natural support for per-sentence classification. You’d probably want an extension attribute to handle this. You can store the labels in the doc.user_data dictionary, and then retrieve them with a custom getter on the Span object. This should let you write something like sent._.my_label. Happy to elaborate on this if it’s the path you want to go down.

As for how to implement the pipeline component, you can either implement a wrapping object and call into a TextCategorizer instance, or you can subclass. Here’s a rough sketch of what the subclass would be looking like:


class SentenceCategorizer(spacy.pipeline.TextCategorizer):
    name = 'sentcat'
    def predict(self, docs):
        sentences = []
        for doc in docs:
            for sent in doc.sents:
                sentences.append(Doc(doc.vocab, words=[w.text for w in sent]))
        sent_scores = self.model(sentences)
        # Make a nested list, where element i is a list of docs[i]'s sentence scores.
        doc_sent_scores = []
        for doc in docs:
             doc_sent_scores.append([])
             for sent in doc.sents:
                 doc_sent_scores[-1].append(sent_scores.pop(0))
        return doc_sent_scores, [doc.tensor for doc in docs]

    def set_annotations(self, docs, scores, tensors=None):
        for doc, sent_scores in zip(docs, scores):
            doc.user_data['my_sent_scores'] = sent_scores

In summary: we’ve tried to make sure that the compiled parts of Prodigy shouldn’t matter, because they only affect the actual annotation service. Once the data is annotated, you can train and run your models however you like; either with spaCy (which is open-source), with your own code, or with something third-party. We also provide the code for the recipe scripts that feed the annotation server, so it’s easy to do data transforms on the way into Prodigy, and it’s easy to manipulate the annotated data (which is just in jsonl files) after the annotations are complete.

Hope that helps!

1 Like

Alright! Thanks for the quick response.

  1. Completely agree about this, and I think the mechanism is nice and clear. There are some great examples of this via the custom recipes documentation (which I used to get to where I’m at now, so thank you!).

  2. Organizing arbitrary labels via an extension attribute is currently what I’m using, via a custom spacy pipeline component. These custom pipeline components are not currently inheriting from spacy.pipeline.TextCategorizer as you’ve sketched, but are their own generic classes, similar to what’s in the custom pipeline docs. I’ve found this approach works nicely for prototyping custom functionality but breaks down when I attempt to couple it with prodigy. I’ll have a crack with your custom SentenceCategorizer sketch in hand and see how I go integrating generic TextCategorizer pipeline components! Thanks a bunch for taking the time.

I’m still a bit unclear about some of the distinctions between prodigy and spacy, specifically about the prodigy.models.textcat.TextClassifier and the spacy.pipeline.TextCategorizer classes. Any chance you could clarify the relation between them?

  1. Transparency is one of the most compelling features of prodigy, and I agree with everything you said, it’s nice to have options etc.

In hindsight, we shoud have probably chosen better names for Prodigy's built-in annotation models. Their main purpose is to perform scoring and updating for annotating with a model in the loop. For text classification, this is a bit more straightforward, because we only need to predict one label over the whole text and update with a definitive answer on that label (label applies or not). We've also implemented some tweaks to better handle long text vs. short text.

For NER, the annotation model is a bit more complex, because we need to score all possible analyses of the text (beam) and then update with incomplete, binary information (see my slides here for some background if you're interested). All of this is orchestrated within the annotation model – but just like the text classification annotation model, it expects to take a regular spaCy nlp object and uses that for scoring and updating.

Btw, if you haven't seen it yet, you might find this example recipe useful, which shows a textcat.teach-like annotation workflow with a dummy model, to illustrate how it interacts with a model in the loop:

1 Like

I think my intuition about their function was largely as you’ve just explained, and I only had a small amount of uncertainty about their similarities. That makes good sense!

That recipe is fantastic! I was looking around for something of the sort, thanks for sharing. To that end, and I’m not sure if I can cross-reference a spacy github discussion as well (though I think it might be useful for elaborating upon what my broader intent is), but I’ve recently asked for some help WRT wrapping pytorch models as thinc models, and using those models within spacy’s TextClassifier class. The discussion can be found here.

I hope you can see where I’m going with this (using custom, pytorch-wrapped, thinc models in a semi-supervised way with both prodigy and spacy), and perhaps you might have some thoughts on such a workflow. Prodigy-wise, I think re-purposing the recipe you just linked might allow me to use such a model, with the spacy teething issues being the last piece of the puzzle!