Healthcare NER or Text Classificaiton


New to the Spacy/Prodigy world and the help support has really started me off great.

That being said I am trying to see if I can leverage Prodigy in the healthcare space to evaluate organ donor text (ie drug use).

I have a large corpus, and looking to classify social risks like iv drug use.

Here is my workflow:

  • Obtain text: should i lemmanize it?

  • Train vectors: i have 15 million words, but only 250K unique ones

*Attempted classification with poor results

I noticed the text grammar and blobs contain many topics causing probable havoc on word vectors. I think a few of ways of addressing it are:

  1. Break the blurbs into “sentences” and classify them
  2. Realize that I may not have enough data for word vectors
  3. Maybe “rules” based nlp might work better
  4. Address the class imbalance that may be driving prodigy to the majority class by sampling a potentially more balance document set

I am hoping you all can be a sounding board.


Hi Andrew,

I think the first decision you’ll have to make is how to map the task you want the program to achieve down into a series of annotations an ML model can predict. There’s often multiple ways to do that. Lately I’ve been giving some talks about this aspect of machine learning, which I think is under-discussed:

I think a text classification approach will probably be best for your problem. However, this is not entirely certain.

No, stripping the inflections will prevent the pre-trained models from working properly, as they assume the input is normal, running text.

You might not need to train your own vectors. Try working with the pre-trained GloVe vectors provided by spaCy first.

What did you do to attempt classification? How many examples did you annotate with the categories you’re interested in?

Thanks for the video. It was really helpful :). I may look into the GloVe vectors…challenge is I am working directly with bio-medical notes.

After looking at others in terms of cleaning up the text…that helped the word vector model (case normalization and the like)

In terms of what I am trying to do…I am looking for explicit statements of certain health behaviors such as IV drug use. Believe it or not, its not standardized in the text, and the grammar of bio-medical notes tends to be poor at best.