Imbalanced classes in a multiclass textcat leads to completely biased predictions

Matthew,

I really appreciate sharing your NLP expertise. The factors you’ve mentioned have occurred to me and are on my radar, but your specific insight has helped me to think about these issues in a new way.

The reason I haven’t tried to address these data- and problem-definition issues in a wholesale manner is because I’m trying to use SpaCy and Prodigy to quickly build out a proof-of-concept, and I’m doing it on my own time. I feel that my company could benefit from advanced NLP pipelines and models (which I’m relatively proficient at), but it’s not my core responsibility, and without a decent solution, it’s hard to convince leadership to allocate my time to solving these specific problems. I believe SpaCy and Prodigy can dramatically reduce the amount of time it takes me to build out a concept, hence why I’ve invested my time and money in learning them. However, documentation is still a bit raw and examples a bit sparse, and despite reading everything available and studying all the non-compiled code, the next steps just weren’t as obvious to me as they would be if I were using tools I’m more familiar with.

I’m looking forward to those wrappers for tools like Tensorflow, but in the meantime, I appreciate the tip to tinker with the CNN architecture. Time to start digging into the thinc documentation.