Does the train recipe in 1.9.0 auto-dedupe?

edward · December 19, 2019, 1:31pm

Hi,

Previously Prodigy would report round numbers for train/eval splits (e.g. 500/500 for a 1000 document dataset). But now there are values like 487/487. Is Prodigy now auto-deduping my data? (This is good, if so, just want to make sure something hasn't gone wrong with my stream coincidentally.)

ines · December 19, 2019, 2:30pm

The previous batch train recipes would sometimes report inaccurate or misleading counts (like counts before deduplicating, merging our filtering ignored answers). This is also part of what motivated the refactor and the combined and more consistent train recipe.

Edit: The following paragraph is true for NER, POS tags and dependencies, but not for text classification (didn't notice the tag on this thread before, sorry!).
Another small change that could have an impact here: if you're not explicitly setting --binary flag to train from binary accept/reject annotations, rejected answers will be filtered out and won't be included in the total counts.

If you want to inspect the merged training data created based on your dataset(s), you can also try out the data-to-spacy recipe, which outputs a JSON file in spaCy's format. It even supports merging annotations of different types – like a text classification dataset and two NER datasets with overlapping annotations. Annotations are merged based on their input hashes. For text-based annotations, this means that Prodigy will find all examples with the same input text and merge them into one with all relevant annotations attached.

edward · December 20, 2019, 11:55am

Make sense, thanks!

Topic		Replies	Views
Which number of training labels should I trust	1	364	November 10, 2022
Prodigy review recipe not entirely clear to me	8	618	June 22, 2023
Continue bert.ner.manual annotating where I left of ner , solved	3	329	April 25, 2023
stratitifed sampling usage , solved	1	447	May 18, 2020
Debugging NER - batch_train with custom dataset ner	5	588	October 16, 2019

Does the train recipe in 1.9.0 auto-dedupe?

Related topics