The ner.teach
recipe always asks only for single entities, although I have given several labels. Does this mean that I have to accept when an entity is correctly labeled, even though there are other entities in the text section? How does this affect the goal of having a fully labeled record? The text segments would then be stored incompletely in the dataset?
How are you calling the ner.teach
command? As the docs mention here you can pass a --labels
argument to indicate that these is more than one label that you're interested in.
I call the recipe as it is in the documentaion and give all labels as well. It says something about ner.teach being BINARY
. Maybe that is why only one label is requested per step.
I'm just wondering how to handle it when the data is put into the database and there are more labels in the text section, how to add them, otherwise it's incomplete.
Still for understanding: It already asks for several labels, but one after the other and not completely for the whole text section.
Ah, pardon! I now understand.
The interface is indeed binary, which means that you can either tell it that it is correct or incorrect. This reduces the mental effort required per example, but it does mean that you can only focus on one item at a time.
Prodigy will exhaust all the entities in a sentence before moving on to another one though. So as long as you've wrapped up a sentence you should still be able to train a model on complete sentences. Prodigy is also able to merge all the separate binary labels together before training a spaCy model.
Demo
You can see for yourself by repeating the following steps. Let's suppose that I have a examples.jsonl
with just this one sentence.
{"text": "My name is Vincent and I live in Amsterdam. But I also visit London sometimes."}
This contains three entities that are detectable by en_core_web_sm
. One name, and two places. You can "teach" these examples via;
python -m prodigy ner.teach ner-teach-demo en_core_web_sm examples.jsonl
When you now run python -m prodigy db-out ner-teach-demo
you can confirm that there are three annotations. But let's now see what happens when we export this to a .spacy
file.
python -m prodigy data-to-spacy ner-teach-demo out_folder
This will have created a out_folder
that contains a labels/train.spacy
file. This .spacy
file contains the data in binary format that spaCy needs to train it's model.
If you'd like to inspect it, you can also read these .spacy
files from Python.
import spacy
from spacy.tokens import DocBin
# We need the vocab object of this nlp pipeline later
nlp = spacy.load("en_core_web_sm")
# Load the binary file with labels
docbin = DocBin().from_disk("out_folder/train.spacy")
# Read in the documents.
[doc for doc in docbin.get_docs(nlp.vocab)]
# [My name is Vincent and I live in Amsterdam.,
# But I also visit London sometimes.]
# Check the entities.
[doc.ents for doc in docbin.get_docs(nlp.vocab)]
# [(Vincent, Amsterdam), (London,)]
Note that in the final example; Vincent
and Amsterdam
are both detected in the same sentence.
This is also what happens internally in Prodigy when you run prodigy train
. It will merge the "binary labels " from ner.teach
together in sentences before a model is trained on it.
Alternative
If you really prefer to have an annotation interface that allows you to select all the labels, you can also use the ner.manual
interface. You can still "guide" the labelling experience by sorting your examples.jsonl
file manually first such that interesting examples arise first. This is a more manual process, but it is also more general.
I first created a record with ner.manual
. Then I trained a model on it, which also delivers passable results so far. (recall and precision ~74%/71%)
After that I continued with ner.teach
and added all my 9 labels. After about 2000 steps everything stagnates at about 98%. The first 80% are reached quickly, then only very slow progress. I stopped annotating and trained a model with the dataset. Here we see very high precision values of about 90% but very low recall values of about 50%.
If I use this model in ner.correct
, far too many entities are marked incorrectly.
Do I have an error here? Or how can I improve my procedure?
Have you looked at the examples where the model makes a prediction of an entity while there isn't one? Does that give any information? My procedure is usually "check out the model errors, and try to label more similar examples".
When you're using ner.teach
are you sometimes seeing examples with no entities at all? This was a change that was added in v1.11 to improve accuracy. Before, when using ner.teach
you'd only get examples that actually contain the entity, which might cause the model to "believe" there need to be entities in every example.
When you're using
ner.teach
are you sometimes seeing examples with no entities at all?
Yes, I sometimes see text snippets without entities, but they are malformed. Before the annotation process with Prodigy I scraped a lot of sentences and only inserted sentences containing entities into the database. Maybe I should add sentences without entities.
Partly my sentences are a bit malformed, because this was read from pdfs. But this is hard to correct, so I ignore many samples.
Thank you for your answer.
It's something that can't hurt. In general, the best inspiration for what examples to get more of is to understand what examples the current pipeline gets wrong.