When using the code below, Prodigy creates a new dataset 'dataset3', but then Prodigy serves up repeated tasks for annotation from sentences.jsonl.
prodigy ner.correct dataset2 ./tmp_model ./sentences.jsonl --label LABEL --exclude dataset1
After saving 6 annotation to dataset2 using the code above, I verified that these are truly duplicate tasks by running the code below:
>>from prodigy.components import db
>>dataset1 = db.get_dataset("dataset1")
>>print(len(dataset1))
517
>>dataset2 = db.get_dataset("dataset2")
>>print(len(dataset2))
6
>>print([eg for eg in dataset2 if eg['_task_hash'] not in {eg['_task_hash'] for eg in dataset1}])
[]
>>print([eg for eg in dataset2 if eg['_input_hash'] not in {eg['_input_hash'] for eg in dataset1}])
[]
>>print([eg for eg in dataset2 if eg['text'] not in {eg['text'] for eg in dataset1}])
[]
Potentially the problem could be that the --exclude parameter is being ignored when using the ner.correct recipe. I suspect this is the case because when I pass a fake dataset name like 'fake_dataset_name' to the --exclude parameter, as in the example below, the recipe still starts up without a problem.
prodigy ner.correct dataset2 ./tmp_model ./sentences.jsonl --label LABEL --exclude fake_dataset_name
Even if this isn't the cause of this exclude problem, this still seems like a separate problem. I would think that there should be a warning that the incorrect name of the dataset I passed to the --except parameter is not contained in the list of datasets in the SQL database.
Prodigy version: 1.10.3
OS: Windows 10
SQL DB: SQLite