Let me begin with a quick compliment on all of your impressive work. I'm relatively new to NLP -- Spacy and Prodigy are really amazing tools. I'm still a newbie but definitely see how these can be of value to my projects. (And thanks @Ines for your quick reply to my earlier thread working with custom models.). I'm hoping to get your input on a classification task.
I'm working with a large collection of clinical summaries. My ultimate goal is to identify all cases that involve an opioid-related problem. There might be 500-1,000 words in each summary -- but, the presence or absence of a problem really comes down to one or two sentences that use an opioid-related term. So, in other words, if the clinician indicates an opioid-related term (e.g., "methadone", "heroin", "opioid"), they are almost always referring to a current opioid-related problem, but sometimes this is a negation or a mention of a historical problem. Opioid-related problems is a relatively rare event, so there are a lot more negative than positive cases for opioid-related problems (i.e., highly imbalanced categories).
I tested out different word embedding models but decided to train my own FastText model because of the domain-specific terminology and lots and lots of misspellings. I then used Prodigy to create a collection of opioid terms from the embedding. Prodigy was FANTASTIC for performing this task.
Now, I am wanting to classify the documents as present/absent for an opioid-problem. Does it make sense to first filter out all documents that do not contain any of the opioid-related terms? And, because there is a lot of text that is irrelevant, should perform the classification on sentences that contain an opioid-related term? I would be very interested in annotating either sentences or the entire summaries using Prodigy. For example, could I feed Prodigy the entire document or sentences, and Prodigy would highlight all the opioid-terms to facilitate document- or sentence-level annotation?