I think the first decision you'll have to make is how to map the task you want the program to achieve down into a series of annotations an ML model can predict. There's often multiple ways to do that. Lately I've been giving some talks about this aspect of machine learning, which I think is under-discussed:
I think a text classification approach will probably be best for your problem. However, this is not entirely certain.
No, stripping the inflections will prevent the pre-trained models from working properly, as they assume the input is normal, running text.
You might not need to train your own vectors. Try working with the pre-trained GloVe vectors provided by spaCy first.
What did you do to attempt classification? How many examples did you annotate with the categories you're interested in?
Thanks for the video. It was really helpful :). I may look into the GloVe vectors…challenge is I am working directly with bio-medical notes.
After looking at others in terms of cleaning up the text…that helped the word vector model (case normalization and the like)
In terms of what I am trying to do…I am looking for explicit statements of certain health behaviors such as IV drug use. Believe it or not, its not standardized in the text, and the grammar of bio-medical notes tends to be poor at best.