New to the Spacy/Prodigy world and the help support has really started me off great.
That being said I am trying to see if I can leverage Prodigy in the healthcare space to evaluate organ donor text (ie drug use).
I have a large corpus, and looking to classify social risks like iv drug use.
Here is my workflow:
Obtain text: should i lemmanize it?
Train vectors: i have 15 million words, but only 250K unique ones
*Attempted classification with poor results
I noticed the text grammar and blobs contain many topics causing probable havoc on word vectors. I think a few of ways of addressing it are:
- Break the blurbs into “sentences” and classify them
- Realize that I may not have enough data for word vectors
- Maybe “rules” based nlp might work better
- Address the class imbalance that may be driving prodigy to the majority class by sampling a potentially more balance document set
I am hoping you all can be a sounding board.