Duplicated examples over sessions in NER manual

raulsperoni · May 12, 2022, 2:08pm

Hello, I'm using prodigy 1.11.7, and I'm having problems with duplicate examples over sessions (not within sessions) in the ner-manual recipe. I thought that setting "feed_overlap": False would fix the issue but that didn't work.

I've read this: Duplicate annotations in output - #46 by brdlyrbrts but I'm still not sure what to do.

We have a batch size of 50, and the team regularly saves their work every 50 examples approx. Each one of them enters prodigy with ?session=session_name

Could anyone help?
Thank you!

koaning · May 12, 2022, 5:51pm

My colleague suggests that this might be a bug here but that it could be caused by labelling very quickly. Can you confirm that's not the case for you?

He also mentioned here that folks could try installing the new alpha version to try out a fix. Have you tried that?

pip install prodigy==1.11.8a2 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy

raulsperoni · May 13, 2022, 12:47pm

@koaning , I don't think this is our problem

I tried installing that version and starting prodigy with:

prodigy_config = {
    "host": "0.0.0.0",
    "show_stats": True,
    "highlight_chars": True,
    "buttons": ["accept", "ignore", "undo"],
    "auto_count_stream": True,
    "feed_overlap": False,
    "batch_size": 50,
    "force_stream_order": True,
}

but I still see the same problem. Same example is being fed to all sessions, which is slowing us down badly.

This option (which is mentioned in the thread) didn't allow prodigy to start correctly:

"experimental_feed": true

Thank you for your help

kab · May 16, 2022, 10:17pm

Hi @raulsperoni if you're seeing this issue in Prodigy v1.11.7 I'd recommend lowering your batch size to 10. Is there a significant reason why it's set to 50? If you're just loading examples from a file like the examples 10 should work fine.

Are you able to share any more info on what error you're seeing when starting the alpha version with "experimental_feed": true? Please share in this thread to keep issues contained there: Duplicated examples over sessions in NER manual

raulsperoni · May 17, 2022, 2:55pm

Hello @kab thanks for anwering. The reason is that I'm loading examples from the database and saving the results there too. Lower than 50 seemed too expensive. Do you think this is the reason for this bug? We've worked with the same strategy in classification problems and this does not happen at all, while in NER it happens consistently. Given enough time, every session tags the same example.

I will get back to you with details.

Thanks.

raulsperoni · May 17, 2022, 3:35pm

Here's the output:

ner  | Added dataset test_1 to database SQLite.
ner  | Traceback (most recent call last):
ner  |   File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
ner  |     return _run_code(code, main_globals, None,
ner  |   File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
ner  |     exec(code, run_globals)
ner  |   File "/app/ner_recipe.py", line 147, in <module>
ner  |     prodigy.serve(
ner  |   File "/usr/local/lib/python3.8/site-packages/prodigy/__init__.py", line 49, in serve
ner  |     controller = loaded_recipe(*recipe_args, config=config)
ner  |   File "cython_src/prodigy/core.pyx", line 432, in prodigy.core.recipe.recipe_decorator.recipe_proxy
ner  |   File "cython_src/prodigy/core.pyx", line 78, in prodigy.core.Controller.from_components
ner  |   File "cython_src/prodigy/core.pyx", line 181, in prodigy.core.Controller.__init__
ner  |   File "cython_src/prodigy/components/feed_v2.pyx", line 139, in prodigy.components.feed_v2.Feed.__init__
ner  | TypeError: add_dataset() got an unexpected keyword argument 'feed'

raulsperoni · May 19, 2022, 2:20pm

Hello, @kab @koaning,

I can confirm that this problem is not limited to NER as I thought. I'm also seeing this behaviour in single choice classification task. As I mentioned this is becoming quite a performance problem for us. Have you seen this before? What other things can we try to solve it?

Here's the config for classification:

prodigy_config = {
    "host": "0.0.0.0",
    "choice_style": "single",
    "choice_auto_accept": True,
    "show_stats": True,
    "feed_overlap": False,
    "buttons": ["accept", "ignore", "undo"],
    "auto_count_stream": True,
    "batch_size": 50,
}

thank you!

kab · May 19, 2022, 10:05pm

What database are you using for the alpha version with "experimental_feed": true? I just tried a clean install of version 1.11.8a2 and didn't run into any issues. Do you have a full prodigy.json file you can share?

For 1.11.7 (or 1.11.8a2 with "experimental_feed": false since it'll run the same code) we don't have a concrete fix for the duplicate examples issue. My only recommendation would be decreasing the batch_size. We're planning to fix forward and move users to the new Feed and Database implementations as soon as possible.

Thanks for your patience on this issue.

Topic		Replies	Views
Example repeated/duplicated within and across sessions usage , textcat , multi-user	5	476	December 20, 2022
Duplicates in ner.correct in 1.10.2 done , streams	3	525	August 10, 2020
Prodigy shows examples already in DB when feed_overlap=True and using a named session server	4	802	July 3, 2020
Duplicated examples in db-out for ner.train usage , ner , database	6	380	October 11, 2022
Duplicate annotations in output Getting Started bug , to-be-released , streams	53	3517	January 27, 2023

Duplicated examples over sessions in NER manual

Related topics