ner.correct --exclude not excluding duplicate tasks

,

When using the code below, Prodigy creates a new dataset 'dataset3', but then Prodigy serves up repeated tasks for annotation from sentences.jsonl.

prodigy ner.correct dataset2 ./tmp_model ./sentences.jsonl --label LABEL --exclude dataset1

After saving 6 annotation to dataset2 using the code above, I verified that these are truly duplicate tasks by running the code below:

>>from prodigy.components import db
>>dataset1 = db.get_dataset("dataset1")
>>print(len(dataset1))
517
>>dataset2 = db.get_dataset("dataset2")
>>print(len(dataset2))
6
>>print([eg for eg in dataset2 if eg['_task_hash'] not in {eg['_task_hash'] for eg in dataset1}])
[]
>>print([eg for eg in dataset2 if eg['_input_hash'] not in {eg['_input_hash'] for eg in dataset1}])
[]
>>print([eg for eg in dataset2 if eg['text'] not in {eg['text'] for eg in dataset1}])
[]

Potentially the problem could be that the --exclude parameter is being ignored when using the ner.correct recipe. I suspect this is the case because when I pass a fake dataset name like 'fake_dataset_name' to the --exclude parameter, as in the example below, the recipe still starts up without a problem.

prodigy ner.correct dataset2 ./tmp_model ./sentences.jsonl --label LABEL --exclude fake_dataset_name

Even if this isn't the cause of this exclude problem, this still seems like a separate problem. I would think that there should be a warning that the incorrect name of the dataset I passed to the --except parameter is not contained in the list of datasets in the SQL database.

Prodigy version: 1.10.3
OS: Windows 10
SQL DB: SQLite

Hi! As a quick workaround, could you try setting "feed_overlap": false in your prodigy.json?

Ah, it looks like what's happening here is that the Database.get_task_hashes and Database.get_input_hashes methods don't actually check whether the datasets they receive exist and just return all hashes that are in datasets of the given names. I think they might as well raise an error here (at least I can't think of any undesired side-effects here).

Hi @ines,
Adding "feed_overlap": false to my prodigy.json fixed the duplicate task problem. Thank you for the workaround.

Your workaround also reminds me that I should have mentioned that I used /?session=dshefman for all of my sessions on both datasets. Is this part of what caused the duplicate tasks? Would it be a good idea for me to avoid using any session names until this is fixed?

If it's easy to do, then yes, that's probably a good idea! In general I'd recommend to only use named sessions if you really need them – otherwise, you're asking Prodigy to compute a bunch of stuff you don't need and it can make the streams a bit harder to reason about.

1 Like

@ines I seem to be experiencing a related issue, using Prodigy 1.10.4 on OS X (Catalina) with Python 3.8.5. I had the same issue as in the OP, with repeated tasks during ner.correct --exclude, found this thread, and applied the workaround with feed_overlap. That seemed to fix the issue ... at least until I got about 25-30 annotations in to my next batch. Then, Prodigy started to repeatedly feed the same tasks back to me, as if it was starting from the beginning of the same 25 tasks and looping through again. When I see the task again, the annotations I made are gone. The only way I can get it to stop looping through the same set of 25-30 annotations is to kill Prodigy and re-run the same recipe. Then, I get a new batch of about 25-30 tasks, and the fun begins again.

When I dump the dataset with db-out, it looks like the annotations I made are being saved, but my confidence is a bit shaken. Two further points:

1). I'm using --unsegmented because I'm correcting with a model I trained on a cold start dataset (following the general process you put forth in this video). However, it was complaining that the model didn't set sentence boundaries, so I came upon the --unsegmented option. I am having the same problem with or without it, though.

2). I'm probably missing something, but the docs say that the hashs _input_hash and _task_hash are supposed to be uint32, no? When I look in my db-out output, a lot of the hashes seem to be negative integers

... "_input_hash":-705417333,"_task_hash":-803297770 ...

Any ideas?

Thanks for the detailed report! The underlying issue reported in the original post here should have been fixed in v1.10.4, so I wonder if there's something else going on :thinking: But you can definitely confirm that using the latest version without the feed_overlap workaround doesn't respect the additional datasets provided via --exclude?

As a quick sanity check, could you try and use a new dataset to save your annotations to? The one other time I've seen a problem similar to this one (same batch being repeated), it was likely related to some interaction with the existing hashes in the current dataset. I haven't seen it come up again since but I'd love to get to the bottom of this.

Ohh, thanks for pointing this out, this is a mistake in the docs! This used to be true in early versions but we've since adjusted the hashing logic. I'll change that to just say integer. (If a user wants to implement custom hashing, any integer will do.)

Sorry for the delay in responding, long week. :slight_smile:

I had the repetition issue with and without feed_overlap in the first dataset I was working with under 1.10.4; however, I've just tested it this morning with a new dataset and new source texts, and feed_overlap now has a different effect: when feed_overlap: "false" is present in prodigy.json, the looping problem after 25 annotations was present, but when I took it out, it went away.

However, though the looping problem has gone away after removing feed_overlap, the OP's problem with --exclude is now present again.

If there's anything else I can do to help you track this down, please do let me know.