Hi! Thank you for making such a great product and providing amazing support. It’s been life saving.
I’m working on adding custom attributes to my model’s tokens so that they can be used as extra features to train the NER. I was looking at spaCy’s documentation and figured out I could add them like this and/or like this.
My question is, will prodigy take into account this extra attribute when choosing the most relevant samples and updating the model?
No, spaCy's NER model currently uses the NORM, PREFIX, SUFFIX and SHAPE attributes as its features. Custom attributes are not used, because they can contain pretty much any arbitrary information – so spaCy has no way of knowing what is relevant and what isn't, or how the custom attributes relate to the data.
One thing you could do pretty easily, however, is to use custom attributes to influence the selection of relevant examples. By default, ner.teach uses uncertainty sampling, which is implemented via the prefer_uncertain sorter. Sorter functions take a stream of (score, example) tuples and yield a stream of sorted annotation tasks, based on the score. So instead of using the built-in model to score the examples, you can implement your own function that takes custom attributes into account. Here's a simplified example:
def get_stream(stream):
for eg in stream:
doc = nlp(eg['text']) # process the example text with spaCy
score = doc._.custom_score # get a score from your custom attribute
yield (score, eg)
stream = prefer_uncertain(get_stream(stream)) # sort stream
Of course, how useful any of this will be depends on what you're trying to do.
So, if I understood correctly, if I manage to get spaCy’s NER to take into account this extra feature and did something like using the model first on the data and choose the samples with the lowest confidence instead of prefer_uncertain and the model’s update function somehow uses the new feature also (whether by default behavior or via a custom update function) I could still use prodigy?
Yes, exactly – it would even work if your model wasn’t a spaCy / Thinc model but, say, a PyTorch or TensorFlow model. If you’re using Prodigy with a spaCy model, all feature extraction happens in spaCy – so if you get your spaCy to use your custom features, Prodigy will go along with that. (As far as Prodigy is concerned, it’s simply asking spaCy for a score.)
To implement a custom solution, all you need is two functions like this:
def predict(stream):
for eg in stream:
score = YOUR_MODEL.predict(eg['text'])
yield (score, eg)
def update(examples):
loss = YOUR_MODEL.update(examples)
return loss