Using a handmade annotation file for model training

ines · June 22, 2018, 2:13pm

Yes, if you want to train a statistical model to recognise entities in context, you also need to show it examples of them in context. The context window around the tokens is how the entity recognizer decides whether they should be labelled as an entity or not. That’s also why the training data should always be as similar as possible to the input you’re expecting at runtime. If your model only sees single phrases like this, it might learn that “Short phrases like this on their own are an ORG entity”. Similarly, if you train your model on newspaper text, it’ll likely struggle with tweets or legal documents.

For your use case, I’d suggest one of the following options:

1. Use a rule-based approach instead

Machine learning is great if you have a few examples and want to generalise and enable your application to find other similar examples in context. But despite the hype, a purely rule-based system can often produce similar or even better results. For an example of this, check out spaCy’s Matcher, which lets you build pretty sophisticated token rules to find phrases in your text (based on their text, but also other attributes like part-of-speech tags, position in the sentence, surrounding words etc.)

2. Use your existing terminology list with a model in the loop

If you do want to find similar terms in context and teach the model about them, you can use your existing examples so create training examples in context (assuming you have a lot of text that contains those terms). The --patterns argument on ner.teach lets you pass in a patterns.jsonl file with entries like this:

{"label": "ORG", "pattern": [{"lower": "biocarbon"}, {"lower": "amalgamate"}]}

The patterns follow the same logic as spaCy’s Matcher. The above example will match a sequence of two tokens whose lowercase forms equal “biocarbon” and “amalgamate” respectively. If Prodigy comes across a match in your data, it will label it ORG and show it to you for annotation. You can then decide whether it’s correct or not. This also lets you handle ambiguous entities and teach your model that it should only label a phrase in certain contexts.

As you click accept and reject, the model in the loop is updated with the pattern matches and eventually starts making suggestions, too, which you can then give it feedback on. You can see an end-to-end workflow like this in action in our Prodigy NER video tutorial.

3. Create “fake” context examples with templates

In theory, this can also work – you just have to be careful and make sure the templates actually reflect the type of texts you expect to analyse later on. Otherwise, you can easily create a model that’s kinda useless and only works on things you came up with. But essentially, you would write a bunch of templates with placeholders, add your ORG examples randomly and then use those as training data for your model.

Topic		Replies	Views
annotating entities in text documents usage , ner , solved	15	9922	November 28, 2017
NER manual source data fomat Getting Started usage , ner , spacy	1	240	September 21, 2022
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Starting with XML-tagged Corpus usage , ner , solved	2	639	June 28, 2019
Annotate text with multiple entities using ner_manual usage , ner	4	876	November 26, 2018

Using a handmade annotation file for model training

1. Use a rule-based approach instead

2. Use your existing terminology list with a model in the loop

3. Create “fake” context examples with templates

Related topics