Recommended approaches for combining NER with text calssification

Hi--

Thanks for the great tool:) I have a use case equivalent to the following: I am trying to train a text classifier that classifies whether a length of text evinces positive or negative sentiment with regard to sports teams. Simple enough, but my data often exhibit a single short text in which two teams are mentioned in the same sentence and the sentiment for one is positive while the other is negative, e.g.:

"The Bears are way better than the Spurs!"

What I'd like to be able to do is train a custom NER model to recognize sports teams, and then train a classifier not on the whole text, but on the sentiment associated with each named entity (taking into account the context of the entire text). Do you have any recommendations on approaches that might be useful for this use case? Thank you!

Perhaps you could have the following categories?

  1. Both good
  2. First good, second bad
  3. First bad, second good
  4. Both bad
  5. First neutral, second good
    (etc)

If you frequently have sentences matching three sports teams, you can have categories for that as well --- but I suspect those might be rarer. You might also find that they only occur in situations like "The Antelopes are way better than the Tree Frogs or the Otters", in which case the classification scheme would be something like "First good, others bad".

This approach is likely to be less annotation, and the model will be able to pick up on patterns of contrast, which I think will be much easier than making the decisions individually.

Hi @honnibal thanks for your reply! Sorry for the delay. I've been busy with other projects, and have been thinking about alternative approaches on the basis of your advice. The problem is really that the range of syntactical diversity is a lot greater than the little toy example I've given you. I've done some digging around and think I've found a solution, but it's a little complicated and I was hoping I could get your advice on whether it is implementable within prodigy.

Basically, the problem is I need to classify not in terms of the whole document, but in terms of how categories are applied to specific words (entities) within the document. I've done some digging into how annotations are stored by prodigy and it seems that whole documents with different matched patterns are stored as separate entries. To stick with the above example (but keeping in mind that actually examples are much more syntactically diverse), the following would be separate entries in a standard prodigy dataset (apols if I have the spans wrong, I just wrote them by hand):

{"text":"The Bears are way better than the Spurs.",
"spans":[{"text":"Bears","start":4,"end":8,"priority":0.5,"score":0.5,"pattern":10060362}],
"label":"GOOD_TEAM",
"answer":"accept"}

{"text":"The Bears are way better than the Spurs.",
"spans":[{"text":"Spurs","start":35,"end":39,"priority":0.5,"score":0.5,"pattern":12309846}],
"label":"GOOD_TEAM",
,"answer":"reject"}

My idea is that I'll first train a NER model in prodigy (in this example it would be trained to recognize named sports teams), and then to write a custom prodigy textcat.teach-derived recipe which used my NER model in place of pattern matching. I could then use some kind of context-sensitive vector representation to pass my annotations to a classifier. I've been playing with BERT implementations, and was thinking about starting there. Note than since I want to classify only with reference to the named entities, such an approach would not (could not?) employ active learning. But if you had any recommendations there, I would be happy to hear them.

My question is, would such an approach be possible using spacy/prodigy models? I see from your docs that your NER model uses context sensitive vector representations, but I can't seem to find documentation about your classification models? Do they represent text in a context sensitive manner? If so, are those embeddings sensitive to the information in the spans key of the stream? If not, I would struggle to understand the default behaviour for a dataset exhibiting span-dependent answers, since presumably you'd be training your model to predict opposite outputs from identical inputs :tired_face:.

If this is the case, I suppose I could just use prodigy to annotate and then use a context-sensitive embedding model to classify, and since I would not be taking advantage of active learning in the classification annotating process, it wouldn't be a huge loss. But if such an approach is possible within the prodigy environment, I'd love to know. Thanks so much for all your work! Much appreciated.