Does ner.teach take into account attribute extensions?

Hi! Thank you for making such a great product and providing amazing support. It’s been life saving.

I’m working on adding custom attributes to my model’s tokens so that they can be used as extra features to train the NER. I was looking at spaCy’s documentation and figured out I could add them like this and/or like this.

My question is, will prodigy take into account this extra attribute when choosing the most relevant samples and updating the model?

Thank you, and regards.

No, spaCy’s NER model currently uses the NORM, PREFIX, SUFFIX and SHAPE attributes as its features. Custom attributes are not used, because they can contain pretty much any arbitrary information – so spaCy has no way of knowing what is relevant and what isn’t, or how the custom attributes relate to the data.

You can customise the features of the model, but this will take a little more work. Our video on how spaCy’s NER model works should be a good place to get started. You can also find more details on this in the neural network model architecture section of the docs.

One thing you could do pretty easily, however, is to use custom attributes to influence the selection of relevant examples. By default, ner.teach uses uncertainty sampling, which is implemented via the prefer_uncertain sorter. Sorter functions take a stream of (score, example) tuples and yield a stream of sorted annotation tasks, based on the score. So instead of using the built-in model to score the examples, you can implement your own function that takes custom attributes into account. Here’s a simplified example:

def get_stream(stream):
    for eg in stream:
        doc = nlp(eg['text'])  # process the example text with spaCy
        score = doc._.custom_score  # get a score from your custom attribute
        yield (score, eg)

stream = prefer_uncertain(get_stream(stream))  # sort stream

Of course, how useful any of this will be depends on what you’re trying to do.

1 Like

Thank you for your answer.

So, if I understood correctly, if I manage to get spaCy’s NER to take into account this extra feature and did something like using the model first on the data and choose the samples with the lowest confidence instead of prefer_uncertain and the model’s update function somehow uses the new feature also (whether by default behavior or via a custom update function) I could still use prodigy?

Yes, exactly – it would even work if your model wasn’t a spaCy / Thinc model but, say, a PyTorch or TensorFlow model. If you’re using Prodigy with a spaCy model, all feature extraction happens in spaCy – so if you get your spaCy to use your custom features, Prodigy will go along with that. (As far as Prodigy is concerned, it’s simply asking spaCy for a score.)

To implement a custom solution, all you need is two functions like this:

def predict(stream):
    for eg in stream:
        score = YOUR_MODEL.predict(eg['text'])
        yield (score, eg)

def update(examples):
    loss = YOUR_MODEL.update(examples)
    return loss
1 Like