ner.correct re-using same source and dataset

srandoux · February 4, 2020, 3:05pm

Hi,

I am trying to create a dataset for ner training. My pipeline is the following: I cold start labeling manually with ner.manual. Then I train a model from the manual gold data. I would like to use ner.correct to speed up the manual labeling.

As I am using the same dataset name and source for both manual and correct recipes, I was wondering if this could create duplicates in my final dataset? More generally, is it safe to make a few cycles of train and correct to improve suggestions more and more during annotation?

Thank you

ines · February 4, 2020, 5:17pm

If you're using Prodigy v1.9+, there shouldn't be any problems here: the recipes ner.manual and ner.correct exclude based on the input hash, i.e. the original text only. So two examples with the same text and different pre-highlighted spans will be considered identical and you won't be asked to re-annotate them. You can read more about the hashing and the exclude_by setting in the docs here.

The only thing to consider is sentence segmentation: if your texts aren't pre-segmented, you probably want to use ner.correct with --unsegmented to ensure that it doesn't try to split your texts into sentences. Otherwise, you could end up with duplicates (sentences vs. full texts).

srandoux · February 5, 2020, 8:37am

Thanks a lot, very helpful.

Topic		Replies	Views
Duplicates in ner.manual enhancement , usage , ner , done	3	820	December 18, 2019
How to merge data from ner.correct and ner.teach? usage , ner , database	1	690	November 9, 2020
Best practices for NER annotation to avoid overfitting usage , ner	3	1358	October 21, 2020
how to use ner.correct --update usage , ner , solved	4	682	October 21, 2021
Duplicated examples in db-out for ner.train usage , ner , database	6	380	October 11, 2022

ner.correct re-using same source and dataset

Related topics