Does the train recipe in 1.9.0 auto-dedupe?


Previously Prodigy would report round numbers for train/eval splits (e.g. 500/500 for a 1000 document dataset). But now there are values like 487/487. Is Prodigy now auto-deduping my data? (This is good, if so, just want to make sure something hasn't gone wrong with my stream coincidentally.)

The previous batch train recipes would sometimes report inaccurate or misleading counts (like counts before deduplicating, merging our filtering ignored answers). This is also part of what motivated the refactor and the combined and more consistent train recipe.

Edit: The following paragraph is true for NER, POS tags and dependencies, but not for text classification (didn't notice the tag on this thread before, sorry!).
Another small change that could have an impact here: if you're not explicitly setting --binary flag to train from binary accept/reject annotations, rejected answers will be filtered out and won't be included in the total counts.

If you want to inspect the merged training data created based on your dataset(s), you can also try out the data-to-spacy recipe, which outputs a JSON file in spaCy's format. It even supports merging annotations of different types – like a text classification dataset and two NER datasets with overlapping annotations. Annotations are merged based on their input hashes. For text-based annotations, this means that Prodigy will find all examples with the same input text and merge them into one with all relevant annotations attached.

Make sense, thanks!

1 Like