Using a handmade annotation file for model training

Yes, if you want to train a statistical model to recognise entities in context, you also need to show it examples of them in context. The context window around the tokens is how the entity recognizer decides whether they should be labelled as an entity or not. That’s also why the training data should always be as similar as possible to the input you’re expecting at runtime. If your model only sees single phrases like this, it might learn that “Short phrases like this on their own are an ORG entity”. Similarly, if you train your model on newspaper text, it’ll likely struggle with tweets or legal documents.

For your use case, I’d suggest one of the following options:

1. Use a rule-based approach instead

Machine learning is great if you have a few examples and want to generalise and enable your application to find other similar examples in context. But despite the hype, a purely rule-based system can often produce similar or even better results. For an example of this, check out spaCy’s Matcher, which lets you build pretty sophisticated token rules to find phrases in your text (based on their text, but also other attributes like part-of-speech tags, position in the sentence, surrounding words etc.)

2. Use your existing terminology list with a model in the loop

If you do want to find similar terms in context and teach the model about them, you can use your existing examples so create training examples in context (assuming you have a lot of text that contains those terms). The --patterns argument on ner.teach lets you pass in a patterns.jsonl file with entries like this:

{"label": "ORG", "pattern": [{"lower": "biocarbon"}, {"lower": "amalgamate"}]}

The patterns follow the same logic as spaCy’s Matcher. The above example will match a sequence of two tokens whose lowercase forms equal “biocarbon” and “amalgamate” respectively. If Prodigy comes across a match in your data, it will label it ORG and show it to you for annotation. You can then decide whether it’s correct or not. This also lets you handle ambiguous entities and teach your model that it should only label a phrase in certain contexts.

As you click accept and reject, the model in the loop is updated with the pattern matches and eventually starts making suggestions, too, which you can then give it feedback on. You can see an end-to-end workflow like this in action in our Prodigy NER video tutorial.

3. Create “fake” context examples with templates

In theory, this can also work – you just have to be careful and make sure the templates actually reflect the type of texts you expect to analyse later on. Otherwise, you can easily create a model that’s kinda useless and only works on things you came up with. But essentially, you would write a bunch of templates with placeholders, add your ORG examples randomly and then use those as training data for your model.

2 Likes