Targeted Sentiment Analysis - Custom Prodigy Recipe

Hi gurus,

I'm developing an annotation workflow for targeted sentiment analysis.

At a high level, this consists of 2 steps:

  1. perform NER on a document to find entities of interest (e.g., "chest pain", "dyspnea")
  2. perform text classification with respect to each entity detected in step 1 (e.g., "Pt has chest pain but no dyspnea" yields the following: CHEST_PAIN=PRESENT, DYSPNEA=ABSENT)

Step 1 is super easy with prodigy's out-of-the-box ner recipes (thanks for making it so easy!)
Step 2 requires some custom recipe development...

I've developed a custom recipe adapted from textcat.manual, but I've adapted the "stream" key to have an iterator containing one-object-per-entity-per-document (rather than one-object-per-document). Basically, something like this:

obj = textcat.manual(dataset,source,loader,label,exclusive,exclude) # override textcat.manual
obj['stream'] = partition_by_span(obj['stream'])

The partition_by_span function yields multiple copies of each document, each copy containing only one span in the "spans" key. Something like this:

def partition_by_span(obj_iter):
    for obj in obj_iter:
        for i,span in enumerate(obj['spans']):
            newobj = deepcopy(obj) # from copy import deepcopy
            newobj['spans'] = [span] # just include the one span we are iterating on. This will highlight the entity/span in the prodigy UI and tell the annotator what entity to consider for sentiment
            del newobj['_task_hash'], newobj['_input_hash'] # delete old _task_hash and _input_hash, since we're about to change them
            newobj = set_hashes(newobj, input_keys=("text", "span_index")) # rehash _input_hash based on text and span_index (otherwise, entities in the same text document will be lost)
            yield newobj

This means that I'll see two annotation opportunities for the document "Pt has chest pain but no dyspnea":

  • "Pt has chest pain but no dyspnea"
  • "Pt has chest pain but no dyspnea"

This works pretty well, but now I have a bunch of annotated "documents" that are actually entities masquerading as documents. I'd like to stitch them back up and put each entity sentiment in the appropriate spans array in the parent document where it belongs.

Is there a way to re-aggregate my partitioned spans/entities within my custom recipe? Or do I have to export my collection of annotated single-entity-documents to jsonl and stitch them back together with a separate python script?


Oh, I see I've basically re-implemented split_spans. :slight_smile:

In any case, what I'm looking to do is use split_spans, then annotate the sentiment of each span, then create a join_spans function to pull all the annotated spans back into the original document.

I've posted some code over at github that does this (although I had to use some egregious hacks to get it done).

Question 1: Am I approaching this problem correctly? It seems more correct to associate the sentiment of each span in the original document. The alternative would be to use split_spans and then keep them split, but then the actual document is spread out across multiple pseudo-documents, which will make the annotated set difficult to use in a downstream task.

Question 2: You'll see in my code that I'm capturing the split span annotations in update, re-joining them using my join_spans function, saving them in a global variable (!!!), then returning those re-joined documents in before_db. I definitely feel like I'm working against the API -- can you recommend a more elegant approach?


I think a custom Python script is probably a good idea for the joining, because you'll be able to check the integrity and handle it in your own way. The logic is kind of simple, so I personally prefer to own those bits of processing when I'm doing these things.

Question 1: Am I approaching this problem correctly? It seems more correct to associate the sentiment of each span in the original document.

I think your plan seems reasonable, but perhaps you can just intersect multi-label text-classification per sentence and your entity annotations?

It seems unlikely that you'll have both CHEST_PAIN=PRESENT and CHEST_PAIN=ABSENT in the same sentence, right? If so, you could just try to predict {SYMPTOM}_{PRESENT, ABSENT} as classes over each sentence. So if you have say, 100 symptoms, you'd have 200 classes predicted by your text classifier, and you'd still have your NER system to recover the spans under reference. You could have a rule that you discard predictions for symptoms that aren't recognised by the NER if you want.

Of course, if you don't need the actual span anchors, you could just use the text classification approach, and avoid the NER step. But it might make the output harder to work with for a human, because you'd have more trouble associating the predictions to specific words.

Thanks for your reply, Matthew.

I'll take a closer look at the approach you recommend (i.e., splitting and rehashing the documents in a prodigy recipe, but merging them back together outside of a prodigy session). It seems like prodigy.models.ner.merge_spans is close to what I need, although I'll probably need to roll my own since merge_spans only merges the document's answer into the span (it doesn't also merge the accept key, which contains the PRESENT/ABSENT label). The downside with this approach is that it's difficult for Prodigy to see which documents have already been annotated -- since they get split/rehashed before getting saved to the database, the original documents always seem "new".

The downside of the other approach (i.e., splitting, annotating, and remerging all in the same prodigy session) is that partially-annotated documents will be flagged as complete, even if there were some unannotated entities left at the end of the document. So there are some trade-offs.

Also, thank you, I hadn't considered approaching this from a text classification perspective -- really interesting idea. I agree that most sentences won't have CHEST_PAIN_ABSENT and CHEST_PAIN_PRESENT, although plenty will simultaneously have something like CHEST_PAIN_ABSENT and DYSPNEA_PRESENT.

Thanks again for a great tool -- really enjoy the UI and recipe API!