Prodigy data requires single spans for NER annotation (labeling one entity at a time). But I think the NER model requires all of the entities labeled at training time. Please correct me if I am wrong.
What we noticed is that let’s say we train on Prodigy data where a “label” is one span per sentence. It works OK when we input a sentence for prediction doc = nlp('I am Obama.'), and say Obama is tagged as PERSON. But when we input a paragraph: doc = nlp('I am Obama. I am Obama.'), the NER logic only seems to be able to tag one of the Obamas, and not all of them.
A corollary to this is, since Prodigy is only single span, I’ve never seen a Prodigy trained NER model predict more than one entity in a sentence.
Actually we put a lot of work into relaxing that constraint! Prodigy uses a fairly intricate training strategy to learn from sparse annotations. To learn from binary feedback, we use beam-search to get a set of candidate analyses, and the use the annotations to decide which parses in the beam are known-bad. We then perform a global weight update, so that the known-bad parses become less likely, and the remaining analyses become more likely.
Now, if you simply take the binary feedback, convert it into spaCy's format, and train using spaCy's default NER training algorithm, none of this will be true --- because that code does assume the entity annotations are complete. Similarly, if you convert for use in some arbitrary NER model, you'll have the same issue, unless you have an objective that can learn from binary feedback.
The prodigy ner.batch-train command has a --no-missing flag that you can use to tell Prodigy whether you're learning from complete annotations, or the binary feedback you get from ner.teach. After using ner.teach to create binary annotations.
There are a number of tools in Prodigy and spaCy to help you manage the distinction:
ner.batch-train lets you specify whether you're working with complete annotations, using the --no-missing flag. This flag is set to False by default, i.e. by default ner.batch-train does not assume the entity annotations will be exhaustive.
You can "upgrade" binary feedback by piping the output of ner.print-best into ner.make-gold. The command sequence would be something like this:
prodigy dataset incomplete-ner
# Create binary feedback, store in dataset incomplete-ner
cat my_data.jsonl | prodigy ner.teach incomplete-ner /tmp/model1
# Learn from your binary annotations
prodigy ner.batch-train incomplete-ner en_vectors_web_lg --output /tmp/model2
# Have a look at what the model thinks is the best parse compatible with your annotations.
# If the best-scoring parse had an entity you clicked "reject" on, or was missing an entity you clicked "accept" on, that parse will be discarded, until we find the best-scoring parse that meets the constraints.
prodigy ner.print-best incomplete-ner /tmp/model2 --pretty | less
# Now we take the 'best' parses from before, correct the mistakes, and save them in a new dataset.
# The parses should already be quite good, making the correction process quicker than it would be
# if you were doing ner.manual from scratch.
prodigy ner.print-best incomplete-ner /tmp/mode2 | prodigy ner.manual complete-ner
# Now we can either train a new model:
prodigy ner.batch-train en_vectors_web_lg complete-ner --no-missing
# OR, we can export the data:
prodigy ner.gold-to-spacy complete-ner > /tmp/my-complete-data.jsonl
# I've just pushed spaCy v2.1.0a1, which features a new conversion tool to make training with this output easier. You'll need to install spacy-nightly in a different virtualenv.
spacy convert /tmp/my-complete-data.jsonl > /tmp/my-complete-data.json
# Here I'll assume we actually have two datasets, for training and development.
spacy train en /tmp/final-model /tmp/my-complete-data-train.json /tmp/my-complete-data-dev.json --vectors en_vectors_web_lg --no-parser --no-tagger
I would recommend saving the output into a new dataset, which I've termed here complete-ner. It's a little bit easier to work with the data if you know all the annotations in it are complete. Having a mix of complete and incomplete annotations is more messy.
So, you use the incomplete-ner dataset as input to ner.print-best, which prints out the best parse in jsonl format. Then you pipe that forward into ner.manual to make the corrections.