Feed Overlapping Issues.

Hello Prodigy Team, Alex from AXSMARINE here.

I wanted to discuss an issue we've been encountering when distributing Prodigy tasks among our annotators and share the options we've explored thus far. We are using sessions "/?session=alex" to distribute workflow.

Currently, we've been using the following configuration settings:

Prodigy version : Prodigy v1.11.11

Options used in prodigy.json:

{
  "card_css": {
    "textAlign": "left",
    "fontSize": 16
  },
  "custom_theme": {
    "cardMaxWidth": 1200
  },
  "force_stream_order": true,
  "feed_overlap": false,
  "allow_work_stealing": false,
  "db_settings": {
    "mysql": {
      "db": "my_db"
    }
  }
}

Recipes used:

"ner.maunal"
"spans.manual"
"textcat.manual"

We deploy the Prodigy instance in our Kubernetes cluster.

Problem description:

While these settings initially worked as intended, we have observed occasional task overlaps if Prodigy remains active for an extended period. We've been addressing this issue by having our labeling operators cross-reference document ID's that are provided in the meta data.

It appears that overlaps tend to occur when an annotator steps away from their workstation for an extended period of time, particularly at the end of the workday. Subsequently, on the following day, we often encounter several documents that have already been labeled, creating redundancy.

Interestingly, we've noticed that the recurring documents typically have the same IDs, suggesting that they are cycling repetitively in a seemingly random manner.

Restarting the Prodigy instance temporarily resolves this problem, but we are looking for a more sustainable solution to prevent feed overlaps from recurring.

I would greatly appreciate any suggestions or recommendations you may have to mitigate this issue and ensure smoother task allocation and management.

Best regards.

hi @axsmarine!

Thanks for your message and welcome to the Prodigy community :wave:

Have you seen my post on post on duplicate or missing data?

One thing I suspect it could be is work stealing. I noticed you have that in your config ("allow_work_stealing": false,) but that wasn't implemented until v1.12.0. Since you're using v1.11.11, can you upgrade to v1.12.7 and retry?

But be aware - work stealing may sound like something you don't want, it is a preventive mechanism to avoid the loss of records in a stream, which can be much worse than duplicates!

The docs explain this well:

Without work stealing you might be able to guarantee annotations occur at most once while accepting losing a few examples in the process. By enabling work stealing you ensure all examples will be annotated in the data stream at least once .

Unfortunately, the best answer is better training for your annotators to save their annotations when they're done and close out browser windows when they're done.

One other point -- won't contribute to your issue -- is that "force_stream_order": true won't do anything as it was deprecated in v1.11.0.

Hope this helps!

Hi Ryan and thanks for the fast response!

Yes I've read your answer on "allow_work_stealing", I did not notice it was implemented into v1.12.0.

We will for sure test it out by upgrading our Prodigy.

Work stealing is not an issue for us because we have created a procedure that locates unlabeled files and includes them in the next labeling dataset.

On the force_steam_order note, I have read in the docs here that it may help to some degree but now I see we don't need it anymore.

I will give feedback soon!

Hi again @ryanwesslen

We looked for specified version of Prodigy 1.12.7 but the download links are expired

Notification: "Access to this file has expired. Contact the merchant for access."

Can you resend the download links to us and is there anything you need as action or information from our end to make this happen?

hi @axsmarine,

We've moved a newer platform to distribute wheel files. We've sent several emails on this update this summer.

Go to https://download.prodi.gy and then when prompted, use your license key as the Username. This website now provides you all wheel files you're eligible for. The previous links were through the sales vendor, SendOwl, which only allowed you to download the most recent.

You can also still install Prodigy via pypi instead.

Hello @ryanwesslen ,

After roughly two weeks of testing we are happy to say that our issues with overlapping were fixed after upgrading to Prodigy 1.12.7 and applying recipe option "allow_work_stealing": false.

We are very grateful for your help! :slight_smile:

We also have a quick question regarding our database schema:
After moving to Prodigy version 1.12.7 we noticed the automatic creation of three new tables
"structured_example", "structured_input", "structured_link"

Are these new tables essensial for Prodigy and if not can we remove them since no data is poored into them?
We are still only using the previous tables "dataset", "example", "link"

If you can provide some documentation link for their purpose I would be happy to explore it because I could not find any information about them.

hi @axsmarine,

Sorry for the delay. I had to check on this with one of our Prodigy dev leads.

You can delete them but they'll get auto recreated next time you run prodigy. We'll look into making a patch that removes them from the list of tables to auto-create if they don't exist, but that'll be in a new version. So unfortunately for now you may deal with them for now.

We don't have documents because these tables are part of a longer plan to introduce structured examples down the road.

1 Like

Hi @ryanwesslen,

Thank you for the clarification, we will deal with them manually.