NER overlapping datasets, meaning of lack of annotation

eaubin · April 25, 2019, 1:12pm

My NER workflow has been been to use ner.teach to create an initial model, then create a gold dataset for each document, export with db-out and concatenate all gold datasets and batch train a final model.

I initially did this for a set of documents and a few labels and now I’m adding another label and creating a new dataset for each new label/document pair, reusing the same document set.

How does prodigy interpret the same sentence appearing twice in the dataset with different labels? Does the lack of an annotation indicate that a token is definitely not part of an entity, or that it is unknown? Do the annotations occurring in the same text need to be merged prior to training?

I’m wondering if I have gold datasets for labels X, Y for documents 1-10 and gold datasets for label Z only for document 1-3 am I hurting performance by asserting that there are no occurrences of Z in 4-10?

ines · April 25, 2019, 4:34pm

When you run the built-in ner.batch-train, Prodigy will automatically merge all examples on the same input, i.e. the same text (determined by comparing the input hashes of the examples). The "spans" will then be merged together as well.

By default, the training process will assume that all missing values are unknown – so if there's no entity annotation for a token, it's treated as a missing value rather than an O token (definitely outside an entity). This allows training from binary annotations like the ones you collect in ner.teach. (To disable this behaviour and train from gold-standard annotations where you know that unannotated tokens are definitely not entities, you can set the --no-missing flag btw.)

To update the model with incomplete annotations, Prodigy essentially generates the best possible analysis of the example given the constraints defined by the annotations. If your data includes conflicting spans, those will have to be ignored – but if they contain different pieces of the information about the example, we can put this together and update the weights proportionally, even if we don't know the full truth.

My slides here show an example of this process.

If you're performing all those updates while treating unlabelled tokens as missing values, then you might actually improve accuracy because you'd be preventing the model from predicting Z where you definitely know it doesn't occur. However, if you have gold-standard annotations, you might as well take advantage of that and update the model in a way that treats unlabelled tokens as O.

You might want to check out this example of a silver-to-gold workflow btw. It lets you create gold-standard from silver-standard data (e.g. binary annotations) by generating the best analysis and then correcting it manually if needed.

github.com

explosion/prodigy-recipes/blob/master/ner/ner_silver_to_gold.py

import prodigy
from prodigy.models.ner import EntityRecognizer
from prodigy.components.preprocess import add_tokens
from prodigy.components.db import connect
from prodigy.util import split_string
import spacy
from typing import List, Optional


# Recipe decorator with argument annotations: (description, argument type,
# shortcut, type / converter function called on value before it's passed to
# the function). Descriptions are also shown when typing --help.
@prodigy.recipe(
    "ner.silver-to-gold",
    silver_dataset=("Dataset with binary annotations", "positional", None, str),
    gold_dataset=("Name of dataset to save new annotations", "positional", None, str),
    spacy_model=("The base model", "positional", None, str),
    label=("One or more comma-separated labels", "option", "l", split_string),
)
def ner_silver_to_gold(

This file has been truncated. show original

Topic		Replies	Views
ner.train number of examples usage , ner	8	1959	August 3, 2018
Best strategy for training an NER engine usage , ner	8	2201	December 27, 2017
Prodigy asking me to label the same data multiple times ner	3	875	November 30, 2020
Help with messy data usage , ner	8	678	January 20, 2019
Training Multiple entities at the Same time? ner , spacy , solved	11	3201	November 27, 2018

NER overlapping datasets, meaning of lack of annotation

Related topics