Hello, I'm using prodigy 1.11.7, and I'm having problems with duplicate examples over sessions (not within sessions) in the ner-manual recipe. I thought that setting "feed_overlap": False would fix the issue but that didn't work.
We have a batch size of 50, and the team regularly saves their work every 50 examples approx. Each one of them enters prodigy with ?session=session_name
My colleague suggests that this might be a bug here but that it could be caused by labelling very quickly. Can you confirm that's not the case for you?
He also mentioned here that folks could try installing the new alpha version to try out a fix. Have you tried that?
Hi @raulsperoni if you're seeing this issue in Prodigy v1.11.7 I'd recommend lowering your batch size to 10. Is there a significant reason why it's set to 50? If you're just loading examples from a file like the examples 10 should work fine.
Are you able to share any more info on what error you're seeing when starting the alpha version with "experimental_feed": true? Please share in this thread to keep issues contained there: Duplicated examples over sessions in NER manual
Hello @kab thanks for anwering. The reason is that I'm loading examples from the database and saving the results there too. Lower than 50 seemed too expensive. Do you think this is the reason for this bug? We've worked with the same strategy in classification problems and this does not happen at all, while in NER it happens consistently. Given enough time, every session tags the same example.
ner | Added dataset test_1 to database SQLite.
ner | Traceback (most recent call last):
ner | File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
ner | return _run_code(code, main_globals, None,
ner | File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
ner | exec(code, run_globals)
ner | File "/app/ner_recipe.py", line 147, in <module>
ner | prodigy.serve(
ner | File "/usr/local/lib/python3.8/site-packages/prodigy/__init__.py", line 49, in serve
ner | controller = loaded_recipe(*recipe_args, config=config)
ner | File "cython_src/prodigy/core.pyx", line 432, in prodigy.core.recipe.recipe_decorator.recipe_proxy
ner | File "cython_src/prodigy/core.pyx", line 78, in prodigy.core.Controller.from_components
ner | File "cython_src/prodigy/core.pyx", line 181, in prodigy.core.Controller.__init__
ner | File "cython_src/prodigy/components/feed_v2.pyx", line 139, in prodigy.components.feed_v2.Feed.__init__
ner | TypeError: add_dataset() got an unexpected keyword argument 'feed'
I can confirm that this problem is not limited to NER as I thought. I'm also seeing this behaviour in single choice classification task. As I mentioned this is becoming quite a performance problem for us. Have you seen this before? What other things can we try to solve it?
What database are you using for the alpha version with "experimental_feed": true? I just tried a clean install of version 1.11.8a2 and didn't run into any issues. Do you have a full prodigy.json file you can share?
For 1.11.7 (or 1.11.8a2 with "experimental_feed": false since it'll run the same code) we don't have a concrete fix for the duplicate examples issue. My only recommendation would be decreasing the batch_size. We're planning to fix forward and move users to the new Feed and Database implementations as soon as possible.