Few records in in the db for the same example

zparcheta · June 8, 2023, 2:37pm

Now the only problem is that sometimes not all examples are shown to the annotator, and I ensure that is not because of deduplications etc. because that is happening only from time to time. The platform says "There are no more tasks" but in the db there are missing records.
If I reboot the server, then the "lost" examples are shown to the annotator.

Definitely this is a bug because if I run the same experiment again, it works 50% of times.
Every time I run an experiment, I clean the DB.

This is a big problem for us because we are not able to use the tool. It is very important for us to have all examples with annotations.

We need a technical support for this issue.

ryanwesslen · June 8, 2023, 4:35pm

Actually that file had only 176 records, not 200. Everytime I ran trying to annotate as fast as possible I still got all 176 records into the database, but I did see 1 time that one was empty.

I'm sorry you're having this problem. But this is a new problem, right?

You originally mentioned the problem was duplicates and the "empty answer" problem, which you mentioned I think you've resolved.

But now the problem is that you're missing records that are in your source/input file? And you're 100% confident it's not due to dedup?

Can you provide details of your experiments so we can reproduce them? For example, can we have some of the exact data and the command/recipe/prodigy.json (if different from above)? Which are the specific records that are missing and under what circumstances (e.g., did you have only 1 annotator or multiple annotators? were you using named multi-user sessions? Were those named sessions created on the fly or set in advance by PRODIGY_ALLOWED_SESSIONS?).

Have you also been keeping track of the logs (I'd recommend verbose) that show at the time you're seeing a data loss (e.g., the batch you'd expect there would be 10 records, instead you're only seeing X)? You can even add custom logging to your Prodigy recipe.

If you don't know exactly what records are missing, another debugging tip: I'd recommend you keep meta on (so instead in your prodigy.json have "hide_meta": false, which is the default) and number each of your records. This is what the file I provided you has. This is extremely helpful as the annotator can see the exact order of the records in the bottom right (meta field). If you see a dup or missed example, you can quickly know exactly when/which one is missing. That's where having a snippet of the logs would be critical to know at that point.

Today's a holiday, and most of the team is off, but I'll try to raise your issue tomorrow. But any additional details would be incredibly helpful for us.

ryanwesslen · June 8, 2023, 5:34pm

Also - another thought - in your prodigy.json you modified a lot of the configuration settings -- are you getting the same problem by running on Prodigy's defaults?

For example, you set "auto_exclude_current": false (actually your prodigy.json has it twice) when the default is true. Also same goes for "force_stream_order": true (I think the default is false). What was your thinking about this? What happens if you remove (i.e., use Prodigy's defaults instead)?

zparcheta · June 9, 2023, 7:39am

Yes @ryanwesslen, this actually is one of the issues we had from the beginning.
To put you in context is that actually we use prodigy to evaluate our machine translation engines, so it is very important to not lose any of the examples.

The sessions are created on the fly, but anyway we only have one evaluator by task.

Yes, it is a different problem. Sorry for mixing issues.
I am 100% sure because sometimes it is working fine with exactly the same examples and same recipe.
I already shared the recipe and prodigy.json.

We know which records are missing because we enumerate them.

I will try, but I don't think it is the reason of lost examples.

We need to do the evaluation in order, so the value is correct.

zparcheta · June 9, 2023, 9:41am

I think prodigy should not allow making more than 1 connection to the same evaluation using the session, or at least give the option to limit it that that can be the origin of all the problems.

We have to assume that the annotator may not realize that he already has an open tab in the browser.

ryanwesslen · June 12, 2023, 1:10pm

hi @zparcheta!

Thanks again for your feedback.

Thanks to a teammate, I realized that actually force_stream_order has been deprecated and is now the default behavior.

The force_stream_order config setting is now deprecated (since Aug 2021's v1.11.0 release) and the default behavior of the feeds. Batches are now always sent and re-sent in the same order wherever possible.

This shouldn't be the reason why you're having a problem but it's worth mentioning that it is no longer in use.

This is an interesting point. On the one hand, restricting behavior like this would definitely simplify possible issues and prevent annotators from accidentally causing problems by opening unnecessary tabs/connections. On the other hand, we think a lot of users would not like such restrictions (especially by default). We tend to assume the user knows what's best so we would tend to err on the side of flexibility than restriction. However, we're always rethinking options so we appreciate you bringing this topic up.

zparcheta · June 13, 2023, 7:48am

@ryanwesslen As you said, as optional configuration, it could solve some problems, mostly in evaluation tasks.

Topic		Replies	Views
UI annotation progress does not match number of examples in database usage , database , server	8	563	December 13, 2021
Session Progress Bar Getting Started usage , custom , front-end	2	227	March 21, 2024
Progress Bar usage , custom , solved	3	364	October 4, 2021
Web interface shows wrong progress	2	14	October 23, 2024
Problem in visualization of the annotation progress bug , streams	1	786	February 11, 2020

Few records in in the db for the same example

Related topics