Prodigy asking me to label the same data multiple times

Hello, I'm using prodigy to train an NER model.

I have imported some old labeled data into a dataset called t0, and I have a file of unlabeled data called data.jsonl.

From there, I do the following:

(In case the code doesn't render properly, here's a paste: https://pastebin.com/DSWC2f0N)


# Train a model with existing data, use active learning to get more annotations

prodigy train ner t0 en_vectors_web_lg --output t0m

prodigy ner.teach t1-silver t0m data.jsonl --label FOO,BAR

prodigy ner.silver-to-gold t1-gold t1-silver t0m --label FOO,BAR

# Merge all annotations together into a new dataset

prodigy db-merge t0,t1-gold t1

# Train a model with existing data, use active learning to get more annotations

prodigy train ner t1 en_vectors_web_lg --output t1m

prodigy ner.teach t2-silver t1m data.jsonl --label FOO,BAR

prodigy ner.silver-to-gold t2-gold t2-silver t1m --label FOO,BAR

# Merge all annotations together into a new dataset

prodigy db-merge t0,t1-gold,t2-gold t2

# Train a model with existing data, use active learning to get more annotations

prodigy train ner t2 en_vectors_web_lg --output t2m

prodigy ner.teach t3-silver t3m data.jsonl --label FOO,BAR

prodigy ner.silver-to-gold t3-gold t3-silver t2m --label FOO,BAR

...

I've noticed that every time I do ner.teach, it seems to go through the dataset in the same order. Specifically, it's repeating a lot of examples I've annotated previously. So I wanted to ask:

  1. How do I avoid ner.teach from asking me to label the same data over and over?

  2. When I label the same data again, does it save it as a duplicate data point? If the data point exists in another dataset, how does db-merge deal with the duplicate?

Also wanted to check if my workflow is the intended workflow. Thanks!

Hi! By "same data", do you mean, the same text or the exact same suggestions? ner.teach will try and show you suggestions based on different analyses of the text and give binary feedback on them. So you can easily see the same text multiple times, with different model suggestions.

If the same question already has an answer in the current dataset, it's skipped – but if you're using a new dataset, you'll see a suggestion again. For example, when you start annotating again and save to the set t2-silver, there are no existing annotations to exclude. You can use the --exclude option to point to one or more dataset names that should be excluded.

Yes, datasets are append-only and can contain multiple annotations of the same data piont – eithe with different suggestions, or exact duplicates. This allows you to store annotations from multiple annotators in the same dataset.

db-merge only concatenates the data and it can't do anything else, because it has no way of knowing what your annotations "mean" and what they refer to. However, when you train a model using prodigy train or export your annotations with prodigy data-to-spacy, all answers referring to the same text (identified by the input and task hashes) will be merged into a single data point to update the model from. This allows you to create multiple datasets, e.g. one dataset for each entity label, and train a joint model with information from all labels.

Hi! I'd met a similar situation when I tried to label more examples by correcting the model's predictions

python -m prodigy ner.correct relatives_data_correct ./tmp_model ./relatives_sample.json --label RELATIVES --exclude relatives_data

My JSON file with samples contains the following snippet:

{"text": "Oooo...I totally disagree. My thirteen day old (3mo now) was hospitalized because of a virus from his big sister. Obviously we didn't know that it was a virus at the time and he was hospitalized. I'm not sure if you meant to downplay the hospital visit, but until your child has been subject to 24-48 hrs with an IV, continuous antibiotics, a catheter and spinal tap it's hard to say that the hospital and ER visit \"us not the absolute worst.\" Especially when you aren't able to take care of both children because one is hospitalized....\n\nOP, wash hands. Often. Let the older sibling feel important by handing you diapers and other baby items. Let them kiss the baby on the baby's feet and ensure the older sibling is staying out of the baby's face. I feel like these minor things will make a huge difference!", "category": "RELATIVES", "community": "Parenting", "utc": 1448931493}, {"text": "Let's say he isn't yours. \n\nHow would anything change? Would you love him less? Would you care for him less? Would he be less yours? \n\nDidn't think so. Get a test done, if you want, but before you do it, think about what it will(or won't) change. ", "category": "RELATIVES", "community": "Parenting", "utc": 1448931573},{"text": "It really is numbers. Fever is an objective sign. Like a rash or a high white blood cell count. Doesn't matter what the baseline is really. If it's high it's high. \n\nNormal body temp can vary over the course of a day as well. ", "category": "RELATIVES", "community": "Parenting", "utc": 1448931649},{"text": "Well, is the boy in his class cute? \n\nBut seriously tell him you're glad he thinks so and it's also fine to think so. ", "category": "RELATIVES", "community": "Parenting", "utc": 1448931694}

I had the following suggestions for annotation:

  • Oooo...I totally disagree.
  • My thirteen day old (3mo now) was hospitalized because of a virus from his big sister
  • Obviously we didn't know that it was a virus at the time
  • Let's say he isn't yours. \n\n
  • How would anything change?
  • Would you love him less?
  • Would you care for him less?
  • Would he be less yours? \n\n
  • Didn't think so.
  • Get a test done, if you want, but before you do it, think about what it will(or won't) change.
  • It really is numbers.
  • Fever is an objective sign.
  • Like a rash or a high white blood cell count.
  • Doesn't matter what the baseline is really.
  • If it's high it's high. \n\n

And it starts again:

  • Oooo...I totally disagree.
  • My thirteen day old (3mo now) was hospitalized because of a virus from his big sister
  • Obviously we didn't know that it was a virus at the time
  • Let's say he isn't yours. \n\n
    ........

And over and over! I checked it out several times and I can't go forward

@aleksei What you describe here is a bit different from the original question in this thread. Double-check that you have the latest version of Prodigy and if not, make sure that "feed_overlap" is set to false in your config. Also see this thread and solution here: Resuming annotations after closing the terminal