Quick background: we’ve been training several NER models using the following process:
- Initial annotation of entity (about 200 examples) using ner.manual
- Train model using ner.batch-train
- Binary classification of examples using ner.teach (passing in a large .jsonl file as the data source, usually with tens of thousands of examples).
Then we after a few hundred judgments are made in step 3, we go back to step 2 to retrain, before repeating step 3.
Issue 1: I’ve noticed that we’ve been seeing lots of duplicates after retraining the model, where we are asked to relabel an example for which the task_hash and input_hash are identical; we have verified that they are duplicates by inspecting the .json file exported via db-out. Shouldn’t prodigy be avoiding showing duplicates? I thought that this was the point of the hashes.
Issue 2: I have also noticed that ner.teach seems to go through examples roughly in the order that they exist in the .jsonl file each time retrain the model and run ner.teach again. Not in precisely linear order, but the first couple dozen examples are typically from the first hundred or so lines of the .jsonl file, the next few couple dozed are from around lines 1-200, etc.
In case it’s relevant: the jsonl file contains many (multiple thousands) lines of windowed text (we do this because the raw files the text comes from are enormous). I’m wondering if these files are too large to do an active-learning search over, and this is what causes the examples to be given in the order we observe.