I am trying to create a dataset for ner training. My pipeline is the following: I cold start labeling manually with ner.manual. Then I train a model from the manual gold data. I would like to use ner.correct to speed up the manual labeling.
As I am using the same dataset name and source for both manual and correct recipes, I was wondering if this could create duplicates in my final dataset? More generally, is it safe to make a few cycles of train and correct to improve suggestions more and more during annotation?
If you're using Prodigy v1.9+, there shouldn't be any problems here: the recipes ner.manual and ner.correct exclude based on the input hash, i.e. the original text only. So two examples with the same text and different pre-highlighted spans will be considered identical and you won't be asked to re-annotate them. You can read more about the hashing and the exclude_by setting in the docs here.
The only thing to consider is sentence segmentation: if your texts aren't pre-segmented, you probably want to use ner.correct with --unsegmented to ensure that it doesn't try to split your texts into sentences. Otherwise, you could end up with duplicates (sentences vs. full texts).