Duplicated examples over sessions in NER manual

Hello, I'm using prodigy 1.11.7, and I'm having problems with duplicate examples over sessions (not within sessions) in the ner-manual recipe. I thought that setting "feed_overlap": False would fix the issue but that didn't work.

I've read this: Duplicate annotations in output - #46 by brdlyrbrts but I'm still not sure what to do.

We have a batch size of 50, and the team regularly saves their work every 50 examples approx. Each one of them enters prodigy with ?session=session_name

Could anyone help?
Thank you!

My colleague suggests that this might be a bug here but that it could be caused by labelling very quickly. Can you confirm that's not the case for you?

He also mentioned here that folks could try installing the new alpha version to try out a fix. Have you tried that?

pip install prodigy==1.11.8a2 --extra-index-url https://{YOUR_LICENSE_KEY}@download.prodi.gy
1 Like

@koaning , I don't think this is our problem

I tried installing that version and starting prodigy with:

prodigy_config = {
    "host": "0.0.0.0",
    "show_stats": True,
    "highlight_chars": True,
    "buttons": ["accept", "ignore", "undo"],
    "auto_count_stream": True,
    "feed_overlap": False,
    "batch_size": 50,
    "force_stream_order": True,
}

but I still see the same problem. Same example is being fed to all sessions, which is slowing us down badly.

This option (which is mentioned in the thread) didn't allow prodigy to start correctly:

"experimental_feed": true

Thank you for your help

Hi @raulsperoni if you're seeing this issue in Prodigy v1.11.7 I'd recommend lowering your batch size to 10. Is there a significant reason why it's set to 50? If you're just loading examples from a file like the examples 10 should work fine.

Are you able to share any more info on what error you're seeing when starting the alpha version with "experimental_feed": true? Please share in this thread to keep issues contained there: Duplicated examples over sessions in NER manual

Hello @kab thanks for anwering. The reason is that I'm loading examples from the database and saving the results there too. Lower than 50 seemed too expensive. Do you think this is the reason for this bug? We've worked with the same strategy in classification problems and this does not happen at all, while in NER it happens consistently. Given enough time, every session tags the same example.

I will get back to you with details.

Thanks.

Here's the output:

ner  | Added dataset test_1 to database SQLite.
ner  | Traceback (most recent call last):
ner  |   File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
ner  |     return _run_code(code, main_globals, None,
ner  |   File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
ner  |     exec(code, run_globals)
ner  |   File "/app/ner_recipe.py", line 147, in <module>
ner  |     prodigy.serve(
ner  |   File "/usr/local/lib/python3.8/site-packages/prodigy/__init__.py", line 49, in serve
ner  |     controller = loaded_recipe(*recipe_args, config=config)
ner  |   File "cython_src/prodigy/core.pyx", line 432, in prodigy.core.recipe.recipe_decorator.recipe_proxy
ner  |   File "cython_src/prodigy/core.pyx", line 78, in prodigy.core.Controller.from_components
ner  |   File "cython_src/prodigy/core.pyx", line 181, in prodigy.core.Controller.__init__
ner  |   File "cython_src/prodigy/components/feed_v2.pyx", line 139, in prodigy.components.feed_v2.Feed.__init__
ner  | TypeError: add_dataset() got an unexpected keyword argument 'feed'

Hello, @kab @koaning,

I can confirm that this problem is not limited to NER as I thought. I'm also seeing this behaviour in single choice classification task. As I mentioned this is becoming quite a performance problem for us. Have you seen this before? What other things can we try to solve it?

Here's the config for classification:

prodigy_config = {
    "host": "0.0.0.0",
    "choice_style": "single",
    "choice_auto_accept": True,
    "show_stats": True,
    "feed_overlap": False,
    "buttons": ["accept", "ignore", "undo"],
    "auto_count_stream": True,
    "batch_size": 50,
}

thank you!

What database are you using for the alpha version with "experimental_feed": true? I just tried a clean install of version 1.11.8a2 and didn't run into any issues. Do you have a full prodigy.json file you can share?

For 1.11.7 (or 1.11.8a2 with "experimental_feed": false since it'll run the same code) we don't have a concrete fix for the duplicate examples issue. My only recommendation would be decreasing the batch_size. We're planning to fix forward and move users to the new Feed and Database implementations as soon as possible.

Thanks for your patience on this issue.

1 Like