Healthcare NER or Text Classificaiton

amplacona · August 17, 2018, 9:21pm

Hello,

New to the Spacy/Prodigy world and the help support has really started me off great.

That being said I am trying to see if I can leverage Prodigy in the healthcare space to evaluate organ donor text (ie drug use).

I have a large corpus, and looking to classify social risks like iv drug use.

Here is my workflow:

Obtain text: should i lemmanize it?
Train vectors: i have 15 million words, but only 250K unique ones

*Attempted classification with poor results

I noticed the text grammar and blobs contain many topics causing probable havoc on word vectors. I think a few of ways of addressing it are:

Break the blurbs into “sentences” and classify them
Realize that I may not have enough data for word vectors
Maybe “rules” based nlp might work better
Address the class imbalance that may be driving prodigy to the majority class by sampling a potentially more balance document set

I am hoping you all can be a sounding board.

-Andrew

honnibal · August 20, 2018, 12:08pm

Hi Andrew,

I think the first decision you'll have to make is how to map the task you want the program to achieve down into a series of annotations an ML model can predict. There's often multiple ways to do that. Lately I've been giving some talks about this aspect of machine learning, which I think is under-discussed:

I think a text classification approach will probably be best for your problem. However, this is not entirely certain.

No, stripping the inflections will prevent the pre-trained models from working properly, as they assume the input is normal, running text.

You might not need to train your own vectors. Try working with the pre-trained GloVe vectors provided by spaCy first.

What did you do to attempt classification? How many examples did you annotate with the categories you're interested in?

amplacona · August 31, 2018, 6:59pm

Thanks for the video. It was really helpful :). I may look into the GloVe vectors…challenge is I am working directly with bio-medical notes.

After looking at others in terms of cleaning up the text…that helped the word vector model (case normalization and the like)

In terms of what I am trying to do…I am looking for explicit statements of certain health behaviors such as IV drug use. Believe it or not, its not standardized in the text, and the grammar of bio-medical notes tends to be poor at best.

Topic		Replies	Views
Classifying long-documents based on small spans of text usage , textcat , medical	3	821	February 11, 2021
Domain-specific NER project usage , ner , medical	1	1793	July 8, 2019
Stuck training some NER models (newbie) usage , ner , best-practices	2	1027	July 16, 2020
Will NER improve Text Categorization?	2	413	July 18, 2022
PubMed word vectors textcat , custom , solved , medical	3	847	September 8, 2021

Healthcare NER or Text Classificaiton

Related topics