Classifying long-documents based on small spans of text

Hi, thanks so much for the kind words! Glad to hear that Prodigy has been useful so far :blush: This is also a very interesting and relevant use case, so definitely keep us updated on your progress!

That sounds like a good plan, yes, and it's definitely something I would try! If you have a reliable and reasonably accurate process to identify the opioid-related terms (terminology lists, named entities etc.), you can make the text classification task much more specific and potentially much more accurate because it doesn't also have to learn whether a text is even relevant in the first place. You just want to make sure that you apply the same selection process during annotation and at runtime. So when you process all of your summaries later on, your workflow would be: check if the doc is relevant (contains related terms), then check the doc.cats for the predicted HAS_PROBLEM score (or something like that).

Yes, you could, for instance, use a simple custom recipe with the binary classification UI and stream in examples that contain a key "spans" describing the terms found in the example. You could also only send examples for annotation that contain terms, so you're focusing only on texts that are potentially relevant. See here for an example of the UI and JSON format: Annotation interfaces ยท Prodigy ยท An annotation tool for AI, Machine Learning & NLP

Here's a simple example of how your stream logic could look: I've used spaCy's PhraseMatcher for matching the related terms. When you go through the texts, you can then check if a text is relevant (contains matches), add the matches as highlighted spans and send it out for annotation with the label, e.g. HAS_PROBLEM. For each example, you can then hit accept or reject.

import spacy
from spacy.matcher import PhraseMatcher
from spacy.util import filter_spans

YOUR_TERMS = ["heroin", "methadone", "opioid"]  # etc.
LABEL = "HAS_PROBLEM"

nlp = spacy.blank("en")
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("OPIOID", [nlp.make_doc(term) for term in YOUR_TERMS])

def make_stream(stream):
    # This expects a stream of examples like {"text": "..."}
    for doc in nlp.pipe((eg["text"] for eg in stream)):
        # Check if text is relevant, i.e. if it contains terms and only send out if relevant
        matches = matcher(doc)
        if matches: 
            matched_spans = [doc[start:end] for _, start, end in matches]
            # Just in case you have overlapping matches
            matched_spans = filter_spans(matched_spans)
            # Generate example and send out for annotation
            eg = {"text": doc.text, "spans": spans, "label": LABEL}
            yield eg

In a custom recipe, this could look like this:

import prodigy

@prodigy.recipe("textcat.custom")
def textcat_custom(dataset, source):
     # Usage: prodigy textcat.custom dataset_name file.jsonl -F recipe.py
    stream = JSONL(source)  # or however else you want to load the data
    stream = make_stream(stream)  # function from above
    return {
        "dataset": dataset,  # dataset to save annotations to
        "view_id": "classification",  # UI to use
        "stream": stream,  # data to stream in
    }

(You could make this a lot fancier if you feel like it โ€“ for example, define some more recipe arguments so you can load in your terms from a file or pass in a label via the CLI.)

1 Like