Duplicated examples in NER.teach & large jsonl files

Quick background: we’ve been training several NER models using the following process:

  1. Initial annotation of entity (about 200 examples) using ner.manual
  2. Train model using ner.batch-train
  3. Binary classification of examples using ner.teach (passing in a large .jsonl file as the data source, usually with tens of thousands of examples).

Then we after a few hundred judgments are made in step 3, we go back to step 2 to retrain, before repeating step 3.

Issue 1: I’ve noticed that we’ve been seeing lots of duplicates after retraining the model, where we are asked to relabel an example for which the task_hash and input_hash are identical; we have verified that they are duplicates by inspecting the .json file exported via db-out. Shouldn’t prodigy be avoiding showing duplicates? I thought that this was the point of the hashes.

Issue 2: I have also noticed that ner.teach seems to go through examples roughly in the order that they exist in the .jsonl file each time retrain the model and run ner.teach again. Not in precisely linear order, but the first couple dozen examples are typically from the first hundred or so lines of the .jsonl file, the next few couple dozed are from around lines 1-200, etc.

In case it’s relevant: the jsonl file contains many (multiple thousands) lines of windowed text (we do this because the raw files the text comes from are enormous). I’m wondering if these files are too large to do an active-learning search over, and this is what causes the examples to be given in the order we observe.

Thanks!

Hi! Your workflow definitely sounds good – answers below:

Are you using the --exclude argument? By default, Prodigy makes no assumptions about what existing examples in your dataset "mean". But you can tell it to explicitly exclude examples that are present in one or more datasets, which will then use the _task_hash to determine whether an example in the set is identical to an incoming example.

I guess a possible explanation would be that if you load in the same stream and don't set it to exclude anything, no examples will be skipped and the stream will start at the beginning. But because you're using a different model, the predictions are slightly different, you'll see slightly different task and thus also submit slightly different annotations. Prodigy calculates an exponential moving average to determine the score threshold for whether to show or skip an example.

In general, we always recommend using larger files for the active learning recipes! This usually lets you take better advantage of the sorting and filtering, and you don't have to worry about running out of data. This is especially relevant if you care most about the number and quality of annotations, rather than having every example in your corpus annotated. So the setup you describe shouldn't be a problem.

I had not been using that option before. But, I just tried restarting a ner.teach session with "--exclude <dataset_id>" and the problem seems to be persisting (I'm getting shown examples that have already been saved to the dataset and have identical task ids)

That's strange and very confusing :thinking: You're looking at the _task_hash, right? And are you using the latest version of Prodigy?

Ok, so after including the ‘exclude’ argument, prodigy seemed to no longer present any tasks where the _task_hash had matched (that I could find).

However: the actual binary task that the prodigy interface was presenting (both the text and the highlighted span) was exactly equivalent in some cases. The _task_hash was different because I had switched to an NER model trained starting from a blank-english model, rather than one trained starting with the “en_core_web_sm”, and this apparently changes the task hash.

Thanks for the update, this is a good point. We should probably exclude that property when we generate the hash. In general, the assumption has been that examples produced by different models should be treated as different – but that's usually reflected in the predictions, tokenization etc. anyways. If the result is identical, the produced examples can definitely be treated as identical, too.