Classifying long-documents based on small spans of text

Let me begin with a quick compliment on all of your impressive work. I'm relatively new to NLP -- Spacy and Prodigy are really amazing tools. I'm still a newbie but definitely see how these can be of value to my projects. (And thanks @Ines for your quick reply to my earlier thread working with custom models.). I'm hoping to get your input on a classification task.

I'm working with a large collection of clinical summaries. My ultimate goal is to identify all cases that involve an opioid-related problem. There might be 500-1,000 words in each summary -- but, the presence or absence of a problem really comes down to one or two sentences that use an opioid-related term. So, in other words, if the clinician indicates an opioid-related term (e.g., "methadone", "heroin", "opioid"), they are almost always referring to a current opioid-related problem, but sometimes this is a negation or a mention of a historical problem. Opioid-related problems is a relatively rare event, so there are a lot more negative than positive cases for opioid-related problems (i.e., highly imbalanced categories).

I tested out different word embedding models but decided to train my own FastText model because of the domain-specific terminology and lots and lots of misspellings. I then used Prodigy to create a collection of opioid terms from the embedding. Prodigy was FANTASTIC for performing this task.

Now, I am wanting to classify the documents as present/absent for an opioid-problem. Does it make sense to first filter out all documents that do not contain any of the opioid-related terms? And, because there is a lot of text that is irrelevant, should perform the classification on sentences that contain an opioid-related term? I would be very interested in annotating either sentences or the entire summaries using Prodigy. For example, could I feed Prodigy the entire document or sentences, and Prodigy would highlight all the opioid-terms to facilitate document- or sentence-level annotation?

Hi, thanks so much for the kind words! Glad to hear that Prodigy has been useful so far :blush: This is also a very interesting and relevant use case, so definitely keep us updated on your progress!

That sounds like a good plan, yes, and it's definitely something I would try! If you have a reliable and reasonably accurate process to identify the opioid-related terms (terminology lists, named entities etc.), you can make the text classification task much more specific and potentially much more accurate because it doesn't also have to learn whether a text is even relevant in the first place. You just want to make sure that you apply the same selection process during annotation and at runtime. So when you process all of your summaries later on, your workflow would be: check if the doc is relevant (contains related terms), then check the doc.cats for the predicted HAS_PROBLEM score (or something like that).

Yes, you could, for instance, use a simple custom recipe with the binary classification UI and stream in examples that contain a key "spans" describing the terms found in the example. You could also only send examples for annotation that contain terms, so you're focusing only on texts that are potentially relevant. See here for an example of the UI and JSON format:

Here's a simple example of how your stream logic could look: I've used spaCy's PhraseMatcher for matching the related terms. When you go through the texts, you can then check if a text is relevant (contains matches), add the matches as highlighted spans and send it out for annotation with the label, e.g. HAS_PROBLEM. For each example, you can then hit accept or reject.

import spacy
from spacy.matcher import PhraseMatcher
from spacy.util import filter_spans

YOUR_TERMS = ["heroin", "methadone", "opioid"]  # etc.

nlp = spacy.blank("en")
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("OPIOID", [nlp.make_doc(term) for term in YOUR_TERMS])

def make_stream(stream):
    # This expects a stream of examples like {"text": "..."}
    for doc in nlp.pipe((eg["text"] for eg in stream)):
        # Check if text is relevant, i.e. if it contains terms and only send out if relevant
        matches = matcher(doc)
        if matches: 
            matched_spans = [doc[start:end] for _, start, end in matches]
            # Just in case you have overlapping matches
            matched_spans = filter_spans(matched_spans)
            # Generate example and send out for annotation
            eg = {"text": doc.text, "spans": spans, "label": LABEL}
            yield eg

In a custom recipe, this could look like this:

import prodigy

def textcat_custom(dataset, source):
     # Usage: prodigy textcat.custom dataset_name file.jsonl -F
    stream = JSONL(source)  # or however else you want to load the data
    stream = make_stream(stream)  # function from above
    return {
        "dataset": dataset,  # dataset to save annotations to
        "view_id": "classification",  # UI to use
        "stream": stream,  # data to stream in

(You could make this a lot fancier if you feel like it – for example, define some more recipe arguments so you can load in your terms from a file or pass in a label via the CLI.)

1 Like

I have been pouring through all the documentation and hoping you can give me a double-check on my approach. The model I trained seems to be quite good -- perhaps too good, which always makes me concerned! The training of your insults classifier is conceptually similar to the problem I am solving, so I relied heavily on your example.

Step 1. Create custom word embeddings on my own data (~200,000 clinical summaries) using FastText. Use this model in Prodigy to bootstrap a terminology list to facilitate training.

Step 2. In some previous work, I had assistants manually annotate about 1,000 randomly selected summaries for different types of problems, including the presence or absence of opioid problems. This initial annotation did not use word embeddings or Progidy -- just manual read/review of each case. These manually annotated summaries were considered my "ground truth." I used the terminology list to flag all summaries that contained at least one opioid-related term.

Of the total summaries flagged, I captured 100% of the cases that were originally identified as opioid-positive. About 25% of the total cases flagged were false positives, but seems to be a reasonable trade-off in reducing the problem, right? (If I understood, this follows your suggestion: "You could also only send examples for annotation that contain terms, so you're focusing only on texts that are potentially relevant.")

Step 3. I then developed the training data on my flagged summaries using my original vectors. Here is the command I used to spin up my server and annotate the full summaries:

!prodigy textcat.teach experiment2_opioids ./dispo_fasttext_vectors ./opioid_dispos.csv --loader csv --label OPIOID --patterns ./opioid_patterns_v2.jsonl

After working through all the training examples, I trained and evaluated the model with the following command:

!prodigy textcat.batch-train experiment2_opioids ./dispo_fasttext_vectors --output opioid-model --eval-split 0.2 

And, within a few seconds, I produced the following results. I also loaded my model and tested a few examples and everything seems to be working quite well. But, as I mentioned, I am hoping to get a double check before I proceed further. (Even if I made a mistake, I am VERY happy to be working with Prodigy -- this is really amazing software!)

Hi Brian,

Really glad to hear your experiments went well! This was a really nice write-up. I think you did almost everything right here, but I do think there could be a problem, depending on what your intention is at this point.

The evaluation figures that Prodigy is producing refer to a 20% split from your training data. However, your training data is developed over these flagged summaries, so it's not an unbiased sample of the overall abstracts, right?

What you might do next is get another sample of summaries, that reflect an unbiased sample of the texts you want your trained model to operate over. This way you can get an accuracy figure that reflects the whole task, so that you can say something like, "If I pick a sentence out of this pile of summaries, and my model says it's about opioids, how often is it wrong?". Your current evaluation doesn't quite tell you that --- it tells you, "If I pick a sentence out of a sample of flagged summaries, and the model says it's about opioids, how often is it wrong?"