I just watched Vincent's nice video (we love those - please keep them coming!) on using sentiment models to bootstrap a compliment classification model. I was not aware of that 'Sentimany' repo (nor of any of the other sentiment models) which made me wonder whether these is any overview somewhere of existing classification models (i.e. not just for sentiments, but also for other classes (compliments, insults, threats, hate speech, aggressive language, irony, sarcasm, evidence, hypotheses, etc. etc.) that could be used in similar ways as shown in the video. Thanks!
Happy to hear it!
So some of the models in sentimany are trained myself, and the full story is that I wrote that repo as part of an exercise explained on my personal blog. If this sounds interesting you may also enjoy this episode of the Huggingface podcast where I talk more about it.
As far as "general pretrained models" go, I think there are two avenues.
- There is a model on huggingface that I re-use. Be aware that it's typically hard to know for sure if the models have been trained on clean data, or if the data that it's using is really relevant for your, but the models are available. Be aware that many of these models are BERT models which really take up a lot of compute.
- You can also try to find relevant datasets and train your own models on those. I usually like to train a simple bag of words model in scikit-learn and export this via ONNX for re-use later. That's what sentimany also uses. If you're interested in learning more about that, there's a calmcode course here.
As always, be careful that you don't put all your eggs in the "pretrained models"-basket. While they can help you prioritise data to annotate, they are just a proxy. Your dataset is probably going to be unique enough that there are edge cases that these pre-trained models fail to capture.