Few records in in the db for the same example

hi @zparcheta!

Trying to step in because Vincent is juggling a lot.

I'm trying to catch up but is the root of your problem that you saw some duplicate records when annotating?

I'm not sure I understand what you mean by the "first example" not saved in the DB and if this is a critical problem you're trying to solve.

And as you mention here, you noticed it once, but not again? Any chance that these duplicates tend to be near the end of your stream?

Also I noticed that your prodigy.json keeps the default of feed_overlap: false.

This sounds like it could be work stealing. I just wrote up a detailed response accumulating a lot of details on this and why it's actually a preventive measure to avoid an alternative problem: examples getting dropped. We have lots of plans in the work to provide alternative options (e.g., task routing and in v2, a complete overhaul of our stream generator that would eliminate the need for work stealing).

It should be noted though, that a small number of duplicates is still expected in multi-user workflows with feed overlap set to false. This is perfectly normal behavior and should only occur towards the end of the example stream. These "end-of-queue" duplicates come from the work-stealing mechanism in the internal Prodigy feed. "Work-stealing" is a preventive mechanism to avoid records in the stream from being lost when an annotator requests a batch of examples to annotate, effectively locking those examples, and then never annotates them. This mechanism allows annotators that reach the end of a shared stream to annotate these otherwise locked examples that other annotators are holding on to. Essentially we have prioritized annotating all examples in your data stream at least once vs at most once while potentially losing a few.

If it is work stealing, probably your best tactic on this is to remind your annotators to save their annotations when they're done and don't keep a browser open indefinitely. Another option you can do that will reduce the chance of duplicates is reducing your batch_size to 1. However, this has the trade-off that users can't go back and modify their last example as accepted records will be immediately saved to the database.

Does this make sense?