NER overlapping datasets, meaning of lack of annotation

ines · April 25, 2019, 4:34pm

When you run the built-in ner.batch-train, Prodigy will automatically merge all examples on the same input, i.e. the same text (determined by comparing the input hashes of the examples). The "spans" will then be merged together as well.

By default, the training process will assume that all missing values are unknown – so if there's no entity annotation for a token, it's treated as a missing value rather than an O token (definitely outside an entity). This allows training from binary annotations like the ones you collect in ner.teach. (To disable this behaviour and train from gold-standard annotations where you know that unannotated tokens are definitely not entities, you can set the --no-missing flag btw.)

To update the model with incomplete annotations, Prodigy essentially generates the best possible analysis of the example given the constraints defined by the annotations. If your data includes conflicting spans, those will have to be ignored – but if they contain different pieces of the information about the example, we can put this together and update the weights proportionally, even if we don't know the full truth.

My slides here show an example of this process.

If you're performing all those updates while treating unlabelled tokens as missing values, then you might actually improve accuracy because you'd be preventing the model from predicting Z where you definitely know it doesn't occur. However, if you have gold-standard annotations, you might as well take advantage of that and update the model in a way that treats unlabelled tokens as O.

You might want to check out this example of a silver-to-gold workflow btw. It lets you create gold-standard from silver-standard data (e.g. binary annotations) by generating the best analysis and then correcting it manually if needed.

github.com

explosion/prodigy-recipes/blob/master/ner/ner_silver_to_gold.py

import prodigy
from prodigy.models.ner import EntityRecognizer
from prodigy.components.preprocess import add_tokens
from prodigy.components.db import connect
from prodigy.util import split_string
import spacy
from typing import List, Optional


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "ner.silver-to-gold",
    silver_dataset=("Dataset with binary annotations", "positional", None, str),
    gold_dataset=("Name of dataset to save new annotations", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
)
def ner_silver_to_gold(

This file has been truncated. show original

Topic		Replies	Views
Duplicate entity annotations ner	4	1954	March 13, 2019
Merging annotations from different datasets usage , ner , database , solved	12	5803	May 28, 2019
How to merge data from ner.correct and ner.teach? usage , ner , database	1	684	November 9, 2020
Prodigy single span data incompatible with NER model which expects all data to be present? usage , ner , api	3	872	August 17, 2018
Debugging NER - batch_train with custom dataset ner	5	587	October 16, 2019

NER overlapping datasets, meaning of lack of annotation

Related Topics