Feed Overlapping Issues.

axsmarine · October 11, 2023, 8:05am

Hello Prodigy Team, Alex from AXSMARINE here.

I wanted to discuss an issue we've been encountering when distributing Prodigy tasks among our annotators and share the options we've explored thus far. We are using sessions "/?session=alex" to distribute workflow.

Currently, we've been using the following configuration settings:

Prodigy version : Prodigy v1.11.11

Options used in prodigy.json:

{
  "card_css": {
    "textAlign": "left",
    "fontSize": 16
  },
  "custom_theme": {
    "cardMaxWidth": 1200
  },
  "force_stream_order": true,
  "feed_overlap": false,
  "allow_work_stealing": false,
  "db_settings": {
    "mysql": {
      "db": "my_db"
    }
  }
}

Recipes used:

"ner.maunal"
"spans.manual"
"textcat.manual"

We deploy the Prodigy instance in our Kubernetes cluster.

Problem description:

While these settings initially worked as intended, we have observed occasional task overlaps if Prodigy remains active for an extended period. We've been addressing this issue by having our labeling operators cross-reference document ID's that are provided in the meta data.

It appears that overlaps tend to occur when an annotator steps away from their workstation for an extended period of time, particularly at the end of the workday. Subsequently, on the following day, we often encounter several documents that have already been labeled, creating redundancy.

Interestingly, we've noticed that the recurring documents typically have the same IDs, suggesting that they are cycling repetitively in a seemingly random manner.

Restarting the Prodigy instance temporarily resolves this problem, but we are looking for a more sustainable solution to prevent feed overlaps from recurring.

I would greatly appreciate any suggestions or recommendations you may have to mitigate this issue and ensure smoother task allocation and management.

Best regards.

ryanwesslen · October 11, 2023, 11:47am

hi @axsmarine!

Thanks for your message and welcome to the Prodigy community

Have you seen my post on post on duplicate or missing data?

One thing I suspect it could be is work stealing. I noticed you have that in your config ("allow_work_stealing": false,) but that wasn't implemented until v1.12.0. Since you're using v1.11.11, can you upgrade to v1.12.7 and retry?

But be aware - work stealing may sound like something you don't want, it is a preventive mechanism to avoid the loss of records in a stream, which can be much worse than duplicates!

The docs explain this well:

Without work stealing you might be able to guarantee annotations occur at most once while accepting losing a few examples in the process. By enabling work stealing you ensure all examples will be annotated in the data stream at least once .

Unfortunately, the best answer is better training for your annotators to save their annotations when they're done and close out browser windows when they're done.

One other point -- won't contribute to your issue -- is that "force_stream_order": true won't do anything as it was deprecated in v1.11.0.

Hope this helps!

axsmarine · October 11, 2023, 1:01pm

Hi Ryan and thanks for the fast response!

Yes I've read your answer on "allow_work_stealing", I did not notice it was implemented into v1.12.0.

We will for sure test it out by upgrading our Prodigy.

Work stealing is not an issue for us because we have created a procedure that locates unlabeled files and includes them in the next labeling dataset.

On the force_steam_order note, I have read in the docs here that it may help to some degree but now I see we don't need it anymore.

I will give feedback soon!

axsmarine · October 16, 2023, 12:02pm

Hi again @ryanwesslen

We looked for specified version of Prodigy 1.12.7 but the download links are expired

Notification: "Access to this file has expired. Contact the merchant for access."

Can you resend the download links to us and is there anything you need as action or information from our end to make this happen?

ryanwesslen · October 16, 2023, 12:27pm

hi @axsmarine,

We've moved a newer platform to distribute wheel files. We've sent several emails on this update this summer.

Go to https://download.prodi.gy and then when prompted, use your license key as the Username. This website now provides you all wheel files you're eligible for. The previous links were through the sales vendor, SendOwl, which only allowed you to download the most recent.

You can also still install Prodigy via pypi instead.

axsmarine · October 30, 2023, 12:09pm

Hello @ryanwesslen ,

After roughly two weeks of testing we are happy to say that our issues with overlapping were fixed after upgrading to Prodigy 1.12.7 and applying recipe option "allow_work_stealing": false.

We are very grateful for your help!

We also have a quick question regarding our database schema:
After moving to Prodigy version 1.12.7 we noticed the automatic creation of three new tables
"structured_example", "structured_input", "structured_link"

Are these new tables essensial for Prodigy and if not can we remove them since no data is poored into them?
We are still only using the previous tables "dataset", "example", "link"

If you can provide some documentation link for their purpose I would be happy to explore it because I could not find any information about them.

ryanwesslen · October 31, 2023, 4:14pm

hi @axsmarine,

Sorry for the delay. I had to check on this with one of our Prodigy dev leads.

You can delete them but they'll get auto recreated next time you run prodigy. We'll look into making a patch that removes them from the list of tables to auto-create if they don't exist, but that'll be in a new version. So unfortunately for now you may deal with them for now.

We don't have documents because these tables are part of a longer plan to introduce structured examples down the road.

axsmarine · November 2, 2023, 9:09am

Hi @ryanwesslen,

Thank you for the clarification, we will deal with them manually.

Topic		Replies	Views
feed_overlap true not working for multiple annotators done , streams	7	449	October 22, 2021
feed_overlap bug? done	7	1307	July 2, 2019
Feed overlap issue (latest release) usage , ner , solved	7	1002	January 19, 2022
Feed overlap not working as expected usage , solved	16	2800	October 14, 2022
Option feed_overlap=false doesn't show expected behaviour usage , streams	3	1424	December 30, 2021

Feed Overlapping Issues.

Related topics