I am annotating a few sentences using Prodigy. There are about 500 sentences in the database. However, Prodigy pops up a message indicating all sentences are done after every few annotations. I cancel the last annotation and redo it, and then it starts showing other annotations in the UI. Is this a bug?
Thanks for the report. Which recipe are you using, and how are you loading in your texts? And did you customise the batch size?
It sounds like for some reason, your queue runs out of examples and the new examples fetched in the background aren’t enough to fill it up in time. One thing you could try is to run the command with PRODIGY_LOGGING=basic
. This will output log statements for everything that’s going on behind the scenes, including API requests and the number of tasks that are sent back and forth.
I think I understood why I saw this issue. I changed my tagset from IOB to just Entities vs Others i.e., instead of B-PER, I-PER, B-ORG, I-ORG, O etc, I made it PER, ORG, O etc [As I did not see much difference in results]. I don’t fully understand how this caused that problem, but it vanished when I switched back to IOB notation. So, now, I just added a I- to all tags except O, and it does not show that message anymore.
Oh, so your label set included the full IOB tags? Do you have an example of the code you ran? I’m curious to see how this might have impacted the stream and how we could possibly prevent that (or show a better error or warning).
In general, Prodigy will handle the IOB / BILUO mapping for you, including the O
label. So if you label a span PER
, the included tokens will receive the respective BILUO tags when you train the model. The ner.batch-train
recipe also lets you set the --no-missing
flag, to explicitly tell Prodigy how to handle untagged tokens. If you set the flag, the annotations are assumed to be gold standard and all unlabelled tokens will be assigned O
and treated as not part of an entity. Otherwise, unlabelled tokens will be considered unknown, which obviously has a different effect on the model.
This lets you train from both gold-standard annotations, as well as sparse annotations created using the binary active learning-powered annotation modes. There’s also an ner.gold-to-spacy
recipe that lets you convert a Prodigy dataset to spaCy’s training format, with an option to export BILUO tags.
I did not write any code for this part, just used prodigy’s ner.iob-to-gold recipe. I am unable to attach iob files here, but let me show example:
File 1:
SOCCER|O -|O JAPAN|B-LOC GET|O LUCKY|O WIN|O ,|O CHINA|B-PER IN|O SURPRISE|O DEFEAT|O .|O
Nadim|B-PER Ladki|I-PER
AL-AIN|B-LOC ,|O United|B-LOC Arab|I-LOC Emirates|I-LOC 1996-12-06|O
File 2:
SOCCER|O -|O JAPAN|LOC GET|O LUCKY|O WIN|O ,|O CHINA|PER IN|O SURPRISE|O DEFEAT|O .|O
Nadim|PER Ladki|PER
AL-AIN|LOC ,|O United|LOC Arab|LOC Emirates|LOC 1996-12-06|O
I converted both these using ner.iob-to-gold recipe.
Output for File 1:
{"_input_hash": -376269529, “_task_hash”: 1927151313, “no_missing”: true, “spans”: [{“end”: 14, “label”: “LOC”, “start”: 9, “text”: “JAPAN”}, {“end”: 36, “label”: “PER”, “start”: 31, “text”: “CHINA”}], “text”: “SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . “}
{”_input_hash”: -1517997846, “_task_hash”: 806001464, “no_missing”: true, “spans”: [{“end”: 11, “label”: “PER”, “start”: 0, “text”: “Nadim Ladki”}], “text”: “Nadim Ladki “}
{”_input_hash”: -701145227, “_task_hash”: 533024259, “no_missing”: true, “spans”: [{“end”: 6, “label”: “LOC”, “start”: 0, “text”: “AL-AIN”}, {“end”: 29, “label”: “LOC”, “start”: 9, “text”: “United Arab Emirates”}], “text”: "AL-AIN , United Arab Emirates 1996-12-06 "}
Output for File 2:
{"_input_hash": -376269529, “_task_hash”: 367841107, “no_missing”: true, “spans”: [{“end”: 14, “label”: “C”, “start”: 9, “text”: “JAPAN”}, {“end”: 36, “label”: “R”, “start”: 31, “text”: “CHINA”}], “text”: “SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . “}
{”_input_hash”: -1517997846, “_task_hash”: -2079071972, “no_missing”: true, “spans”: [{“end”: 11, “label”: “R”, “start”: 0, “text”: “Nadim Ladki”}], “text”: “Nadim Ladki “}
{”_input_hash”: -701145227, “_task_hash”: 1798522814, “no_missing”: true, “spans”: [{“end”: 6, “label”: “C”, “start”: 0, “text”: “AL-AIN”}, {“end”: 29, “label”: “C”, “start”: 9, “text”: “United Arab Emirates”}], “text”: "AL-AIN , United Arab Emirates 1996-12-06 "}
In File 2, you notice that the tags are “C” for LOC, “R” for PER etc. With this kind of file, I got the above mentioned error.