Using a handmade annotation file for model training

In addition to using the Prodigy UI to tag / accept / reject entities, I’d like to train my model against a list of known entities I’ve already got – not derived from an annotation process but generated manually. In this case, it’s just a list of companies. So I’m wondering how flexible the annotation file format is. Ideally, I’d like to just supply the annotated entities in a format such as:

{"text": "Biocarbon Amalgamate", "label": "ORG", "score": 1.0, "answer": "accept"}

Based on the annotated sample here: https://prodi.gy/assets/data/reddit_product.jsonl, I can see that there are a number of metadata fields including the original text and the location of entity within that. The real question is if everything in this sample is required or if I can get away with something similar to my desired format. Thanks.

For training, Prodigy only needs the original input and the actual annotations to train from – for NER, that’s a text and a list of "spans", each with their start and end character offsets and a label. For text classification, that’s a text and a label. And, of course, the "answer", which is either "accept", "reject" or "ignore". Here’s an example of an NER task:

{
    
    "text": "Biocarbon Amalgamate is an entity", 
    "spans": [
        {"start": 0, "end": 20, "label": "ORG"}
    ],
    "answer": "accept"
}

Just make sure that the "text" is always the full text with the entity in context, and that the offsets in the "spans" refer to the character offsets within the text.

For consistency and to prevent any potential conflicts, you should also add hashes to your examples that you add manually. Those help Prodigy identify duplicates or different annotations on the same input text. The following will add an "_input_hash" and "_task_hash" property to each example:

from prodigy import set_hashes
examples = [...]  # your converted annotations
examples = [set_hashes(eg) for eg in examples]

That said, if you already have existing data and you want to pre-train your model before adding more annotations with Prodigy, you can also just train the model with spaCy and then load the model into Prodigy. For example:

prodigy ner.teach your_dataset /path/to/spacy-model ...

Thanks, Ines – so is the contextual text strictly necessary for training? What I mean is that I have a canonical list of pure examples which I know are ORGs and I’d like to train the classifier on that initially. In that case, could I just do the following?

{
    "text": "Biocarbon Amalgamate", 
    "spans": [
        {"start": 0, "end": 20, "label": "ORG"}
    ],
    "answer": "accept"
}

Yes, if you want to train a statistical model to recognise entities in context, you also need to show it examples of them in context. The context window around the tokens is how the entity recognizer decides whether they should be labelled as an entity or not. That’s also why the training data should always be as similar as possible to the input you’re expecting at runtime. If your model only sees single phrases like this, it might learn that “Short phrases like this on their own are an ORG entity”. Similarly, if you train your model on newspaper text, it’ll likely struggle with tweets or legal documents.

For your use case, I’d suggest one of the following options:

1. Use a rule-based approach instead

Machine learning is great if you have a few examples and want to generalise and enable your application to find other similar examples in context. But despite the hype, a purely rule-based system can often produce similar or even better results. For an example of this, check out spaCy’s Matcher, which lets you build pretty sophisticated token rules to find phrases in your text (based on their text, but also other attributes like part-of-speech tags, position in the sentence, surrounding words etc.)

2. Use your existing terminology list with a model in the loop

If you do want to find similar terms in context and teach the model about them, you can use your existing examples so create training examples in context (assuming you have a lot of text that contains those terms). The --patterns argument on ner.teach lets you pass in a patterns.jsonl file with entries like this:

{"label": "ORG", "pattern": [{"lower": "biocarbon"}, {"lower": "amalgamate"}]}

The patterns follow the same logic as spaCy’s Matcher. The above example will match a sequence of two tokens whose lowercase forms equal “biocarbon” and “amalgamate” respectively. If Prodigy comes across a match in your data, it will label it ORG and show it to you for annotation. You can then decide whether it’s correct or not. This also lets you handle ambiguous entities and teach your model that it should only label a phrase in certain contexts.

As you click accept and reject, the model in the loop is updated with the pattern matches and eventually starts making suggestions, too, which you can then give it feedback on. You can see an end-to-end workflow like this in action in our Prodigy NER video tutorial.

3. Create “fake” context examples with templates

In theory, this can also work – you just have to be careful and make sure the templates actually reflect the type of texts you expect to analyse later on. Otherwise, you can easily create a model that’s kinda useless and only works on things you came up with. But essentially, you would write a bunch of templates with placeholders, add your ORG examples randomly and then use those as training data for your model.

2 Likes