ner.correct --exclude not excluding duplicate tasks

,

When using the code below, Prodigy creates a new dataset 'dataset3', but then Prodigy serves up repeated tasks for annotation from sentences.jsonl.

prodigy ner.correct dataset2 ./tmp_model ./sentences.jsonl --label LABEL --exclude dataset1

After saving 6 annotation to dataset2 using the code above, I verified that these are truly duplicate tasks by running the code below:

>>from prodigy.components import db
>>dataset1 = db.get_dataset("dataset1")
>>print(len(dataset1))
517
>>dataset2 = db.get_dataset("dataset2")
>>print(len(dataset2))
6
>>print([eg for eg in dataset2 if eg['_task_hash'] not in {eg['_task_hash'] for eg in dataset1}])
[]
>>print([eg for eg in dataset2 if eg['_input_hash'] not in {eg['_input_hash'] for eg in dataset1}])
[]
>>print([eg for eg in dataset2 if eg['text'] not in {eg['text'] for eg in dataset1}])
[]

Potentially the problem could be that the --exclude parameter is being ignored when using the ner.correct recipe. I suspect this is the case because when I pass a fake dataset name like 'fake_dataset_name' to the --exclude parameter, as in the example below, the recipe still starts up without a problem.

prodigy ner.correct dataset2 ./tmp_model ./sentences.jsonl --label LABEL --exclude fake_dataset_name

Even if this isn't the cause of this exclude problem, this still seems like a separate problem. I would think that there should be a warning that the incorrect name of the dataset I passed to the --except parameter is not contained in the list of datasets in the SQL database.

Prodigy version: 1.10.3
OS: Windows 10
SQL DB: SQLite

1 Like

Hi! As a quick workaround, could you try setting "feed_overlap": false in your prodigy.json?

Ah, it looks like what's happening here is that the Database.get_task_hashes and Database.get_input_hashes methods don't actually check whether the datasets they receive exist and just return all hashes that are in datasets of the given names. I think they might as well raise an error here (at least I can't think of any undesired side-effects here).

Hi @ines,
Adding "feed_overlap": false to my prodigy.json fixed the duplicate task problem. Thank you for the workaround.

Your workaround also reminds me that I should have mentioned that I used /?session=dshefman for all of my sessions on both datasets. Is this part of what caused the duplicate tasks? Would it be a good idea for me to avoid using any session names until this is fixed?

If it's easy to do, then yes, that's probably a good idea! In general I'd recommend to only use named sessions if you really need them – otherwise, you're asking Prodigy to compute a bunch of stuff you don't need and it can make the streams a bit harder to reason about.

1 Like

@ines I seem to be experiencing a related issue, using Prodigy 1.10.4 on OS X (Catalina) with Python 3.8.5. I had the same issue as in the OP, with repeated tasks during ner.correct --exclude, found this thread, and applied the workaround with feed_overlap. That seemed to fix the issue ... at least until I got about 25-30 annotations in to my next batch. Then, Prodigy started to repeatedly feed the same tasks back to me, as if it was starting from the beginning of the same 25 tasks and looping through again. When I see the task again, the annotations I made are gone. The only way I can get it to stop looping through the same set of 25-30 annotations is to kill Prodigy and re-run the same recipe. Then, I get a new batch of about 25-30 tasks, and the fun begins again.

When I dump the dataset with db-out, it looks like the annotations I made are being saved, but my confidence is a bit shaken. Two further points:

1). I'm using --unsegmented because I'm correcting with a model I trained on a cold start dataset (following the general process you put forth in this video). However, it was complaining that the model didn't set sentence boundaries, so I came upon the --unsegmented option. I am having the same problem with or without it, though.

2). I'm probably missing something, but the docs say that the hashs _input_hash and _task_hash are supposed to be uint32, no? When I look in my db-out output, a lot of the hashes seem to be negative integers

... "_input_hash":-705417333,"_task_hash":-803297770 ...

Any ideas?

2 Likes

Thanks for the detailed report! The underlying issue reported in the original post here should have been fixed in v1.10.4, so I wonder if there's something else going on :thinking: But you can definitely confirm that using the latest version without the feed_overlap workaround doesn't respect the additional datasets provided via --exclude?

As a quick sanity check, could you try and use a new dataset to save your annotations to? The one other time I've seen a problem similar to this one (same batch being repeated), it was likely related to some interaction with the existing hashes in the current dataset. I haven't seen it come up again since but I'd love to get to the bottom of this.

Ohh, thanks for pointing this out, this is a mistake in the docs! This used to be true in early versions but we've since adjusted the hashing logic. I'll change that to just say integer. (If a user wants to implement custom hashing, any integer will do.)

Sorry for the delay in responding, long week. :slight_smile:

I had the repetition issue with and without feed_overlap in the first dataset I was working with under 1.10.4; however, I've just tested it this morning with a new dataset and new source texts, and feed_overlap now has a different effect: when feed_overlap: "false" is present in prodigy.json, the looping problem after 25 annotations was present, but when I took it out, it went away.

However, though the looping problem has gone away after removing feed_overlap, the OP's problem with --exclude is now present again.

If there's anything else I can do to help you track this down, please do let me know.

Hi, I found this post since I encounter the same issue with the latest Prodigy version:1.10.4 on OS: Windows 10. I saved 260 anotations with ner.manual and wanted to start again with ner.correct excluding the annotated texts. However, it sends me the duplicated texts starting from the beginning of my input jsonl file.

Then I wanted to try the walkaround solution and correct prodigy.json file which I could not find it. Strange. so far it works with prodigy without prodigy.json file. When installing prodigy, I created a virtual env. Should I add prodigy.json manually to the position where prodigy is installed ?

Second question, can I just do ner.manual with the same dataset name I created last time to continue annotation before I found solution to the duplicated input with ner.correct ?

Thank you very much!

Prodigy will create a prodigy.json in your user home in a directory .prodigy if it doesn't yet exist. You can run prodigy stats to find the exact path. Alternatively, you can override any settings in your global config by putting a prodigy.json in your current working directory.

Hi @xia,

Sorry to hear you're experiencing duplicates. The default configuration right now comes set up for multiple named-sessions, and if you try to annotate without them you can experience duplicates after stopping and starting prodigy. This is because when you don't use a session name and the option is enabled, it uses a default name based on the current timestamp when you start the server. This leads to restarting the server changing the default session name, and you see the examples again.

An easy way find out if this is your problem would be to open your prodigy app using a named session (e.g. http://localhost:8080/?session=my_name), annotate a few examples, then restart the server and refresh the webpage. If this fixes your problem and you don't want to use named-sessions, you can set feed_overlap=False.

I usually create a prodigy.json file in the working directory where my prodigy project is. Here's a simple example of one that turns off feed_overlap:

{
  "feed_overlap": false
}

Hopefully this helps,
-Justin

Hi!

I have the same issue in the second round of my ner.correct annotations when I exclude the previous two datasets.

Here are the steps I followed :

prodigy ner.manual dataset1  en_core_web_lg source_file.jsonl --label my_label 
prodigy train ner dataset1 en_core_web_lg --output ./model1 --eval-split 0.2 --n-iter 10
prodigy ner.correct dataset2  ./model1 source_file.jsonl --label my_label --exclude dataset1
prodigy train ner dataset1,dataset2 en_core_web_lg --output ./model2 --eval-split 0.2 --n-iter 20

So far everything works as expected, but once I do the second round of annotation with ner.correct, I get only 25 examples repeated :

prodigy ner.correct dataset3  ./model2 source_file.jsonl --label my_label --exclude dataset1,dataset2

I replicated the last step multiple times, each time with a separate dataset and session and always got the same results.

Hello, I'm experiencing this same issue using 1.10.5 (osx - Python 3.8.6). I've tried all the different workarounds mentioned (using a different destination, feed_overlap:false, feed_overlap:true, sessions) but all combinations lead to get the initial examples after tagging the example number 26.

The only workaround I've found, is to save, restart the server, and do another batch until I see a duplicate.

I've followed the instructions https://www.youtube.com/watch?v=59BKHO_xBPA word by word with the same result

I tried the same commands with the same datasets (exported as csv files) in a new machine and everything worked without a problem.
However, I couldn't figure out what caused the issue in the old machine.

I am also having this problem following pretty much the steps in the video.

At first I used ner.manual on my data set and patterns, then after enough tagging a generated a tmp_model with the train recipe. I then wanted to continue training and drop the patterns and use the new model to help instead, but when I start ner.correct with a new data set but the same input data, and the old data set excluded, I first get exactly 25 items, but the 26th one wraps around to the first item again, and then I'm stuck in a loop.

I seem to be able to get out of the loop by stopping Prodigy, and restarting, but I'm not sure what's going on or if it's safe to keep going -- I guess I can run a script after to make sure there aren't any duplicates. But I keep having to stop and restart every 25 items.

{
"theme": "basic",
"custom_theme": {},
"buttons": ["accept", "reject", "ignore", "undo"],
"batch_size": 10,
"history_size": 10,
"port": 8080,
"host": "localhost",
"cors": true,
"db": "sqlite",
"db_settings": {},
"api_keys": {},
"validate": true,
"auto_exclude_current": true,
"instant_submit": false,
"feed_overlap": false,
"ui_lang": "en",
"project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
"show_stats": false,
"hide_meta": false,
"show_flag": false,
"instructions": "./instructions.html",
"swipe": true,
"swipe_gestures": { "left": "accept", "right": "reject" },
"split_sents_threshold": false,
"html_template": false,
"global_css": null,
"javascript": null,
"writing_dir": "ltr",
"show_whitespace": false,
"exclude_by": "input"
}

First recipe I ran:

python -m prodigy ner.manual ner_items blank:en ./output/corpus.jsonl --label ORG --patterns ./output/patterns.jsonl

Then I ran:

python -m prodigy train ner ner_items blank:en --output ./tmp_model --eval-split .2

Now I want to take the patterns off and use this model to keep tagging:

python -m prodigy ner.correct ner_items_2 ./tmp_model ./output/corpus.jsonl --label ORG --exclude ner_items

Looking in the database, it looks like there are no duplicates that get saved for the second data set (ner_items_2). I noticed that each item I tag is saved against (by using the link mapping table) the data set I named, and what looks like a dataset named after the date I started prodigy (the session). Not sure if it's normal to keep each session separate. There are however duplicates for my first data set, but I think that looks like it's because I deleted my data set from my first about 50 annotations to change the labels I use, and then re-ran with the same input on a new data set name. Looks like the examples are still in the examples table, but there is only the single link back to the dataset for each of the duplicate items in the examples table.

Edit: Exporting my two data sets, removing the old DB, then importing the annotations again into the same data set, then running annotation again seems to have fixed it!

Thanks for the update, this is super helpful! I mentioned before that I experienced something like this once and never managed to track it down, so having it confirmed again is really useful. Do you still have your old DB by any chance? We'd love to take a quick look at it – the most interesting parts would be the links table and the hashes, not your actual texts or spans, so you can remove those.

Also, how old was your database, i.e. when did you first install and set up Prodigy? (This could help us figure out whether it might have been caused by a change in how Prodigy accesses the hashes in newer versions, or whether it's just a general issue of the database / links table ending up in a weird state.)

Hi ! I am having the same problem as mentioned by the threads here. After running ner.correct I was super impressed at how it was labeling the unlabeled texts until I found an error and corrected it. It was then that I started experiencing the infinite loop. I thought it was the correction that started the loop but maybe in fact it was 25 texts in - as others seem to have experienced.
Is there any update on this - I do see someone had mentioned a workaround by exporting and re-importing but I wanted to check if that was the recommended approach - and perhaps you could restate that again - it wasn't entirely clear to me how to replicate. Thank you

Hey @ines I've been experiencing a similar issue to those described in this thread using prodigy 1.11.6 with spans.manual. In my case, I was, over multiple sessions, getting a duplicate every 34 slides (kind of weird! but always 34) with a dataset that had text that also appeared in other datasets (I don't recall experiencing the issue otherwise?). I removed some datasets featuring the same text values, and the problem did not persist for me today; I did not remove any datasets in another database instance, and the problem is persisting for another annotator. I can pull the example (sans the content column), dataset, and link tables from both prodigy db if it'd be helpful for you to see those.

Hello,

I have been seeing this bug too with the latest version of prodigy doing ner.correct. I was using named sessions without the feed_overlap False. I added this to my config now, I'll report back if I see it keeps happening.

Edit :
Looks like we are still getting some duplicates from time to time. This occurs in the same session.

Edit 2 : Now using prodigy without the named sessions and it is still happening.

From "prodigy progress"

           New   Unique   Total   Unique
--------   ---   ------   -----   ------
Dec 2021   122      109     122      109

Start command :

prodigy ner.correct testdataset ./model_7500/model-best ./shuffled_data.json --label NAME,PHONE

And my prodigy.json file :

{
    "host": "0.0.0.0",
    "port": 8081,
    "show_stats": true,
    "show_flag": true,
    "ui_lang": "fr",
    "feed_overlap": false,
    "custom_theme": {
        "labels": {
            "NAME": "#fabed4",
            "PHONE": "#aaffc3"
        }
    },
    "keymap_by_label": {"NAME": "q", "PHONE": "e"},
    "keymap": {"accept":["d"]}
}

Thanks