Hi @honnibal thanks for your reply! Sorry for the delay. I've been busy with other projects, and have been thinking about alternative approaches on the basis of your advice. The problem is really that the range of syntactical diversity is a lot greater than the little toy example I've given you. I've done some digging around and think I've found a solution, but it's a little complicated and I was hoping I could get your advice on whether it is implementable within prodigy.
Basically, the problem is I need to classify not in terms of the whole document, but in terms of how categories are applied to specific words (entities) within the document. I've done some digging into how annotations are stored by prodigy
and it seems that whole documents with different matched patterns are stored as separate entries. To stick with the above example (but keeping in mind that actually examples are much more syntactically diverse), the following would be separate entries in a standard prodigy
dataset (apols if I have the spans wrong, I just wrote them by hand):
{"text":"The Bears are way better than the Spurs.",
"spans":[{"text":"Bears","start":4,"end":8,"priority":0.5,"score":0.5,"pattern":10060362}],
"label":"GOOD_TEAM",
"answer":"accept"}
{"text":"The Bears are way better than the Spurs.",
"spans":[{"text":"Spurs","start":35,"end":39,"priority":0.5,"score":0.5,"pattern":12309846}],
"label":"GOOD_TEAM",
,"answer":"reject"}
My idea is that I'll first train a NER model in prodigy
(in this example it would be trained to recognize named sports teams), and then to write a custom prodigy
textcat.teach
-derived recipe which used my NER
model in place of pattern matching. I could then use some kind of context-sensitive vector representation to pass my annotations to a classifier. I've been playing with BERT
implementations, and was thinking about starting there. Note than since I want to classify only with reference to the named entities, such an approach would not (could not?) employ active learning. But if you had any recommendations there, I would be happy to hear them.
My question is, would such an approach be possible using spacy/prodigy
models? I see from your docs that your NER
model uses context sensitive vector representations, but I can't seem to find documentation about your classification models? Do they represent text in a context sensitive manner? If so, are those embeddings sensitive to the information in the spans
key of the stream? If not, I would struggle to understand the default behaviour for a dataset exhibiting span-dependent answers, since presumably you'd be training your model to predict opposite outputs from identical inputs .
If this is the case, I suppose I could just use prodigy to annotate and then use a context-sensitive embedding model to classify, and since I would not be taking advantage of active learning in the classification annotating process, it wouldn't be a huge loss. But if such an approach is possible within the prodigy environment, I'd love to know. Thanks so much for all your work! Much appreciated.