Missing first N annotations when using ner.manual recipe

First, thanks for the amazing product!

I am using the ner.manual recipe on some pre-annotated data to annotate DURATION spans, for example. Example row:

{
  "text": "He called yesterday",
  "tokens": [
    {
      "id": 0,
      "start": 0,
      "end": 2,
      "text": "He"
    },
    {
      "id": 1,
      "start": 3,
      "end": 9,
      "text": "called"
    },
    {
      "id": 7,
      "start": 10,
      "end": 19,
      "text": "yesterday"
    }
  ],
  "answer": "accept",
  "spans": [
    {
      "label": "DURATION",
      "start": 10,
      "end": 19,
      "token_start": 2,
      "token_end": 2
    }
  ]
}

Pre-generating the annotations speeds up the corpus creation process for the annotators.
However, I have observed that after each annotation round, the first N annotations go missing. Oddly, around 1 in 10 annotations goes missing and it seems to be always the first N samples presented to the annotator.

Some additional info:

  • these samples are not duplicates of any other sample in the input or DB
  • I run prodigy in a Docker container
  • I can’t recreate the issue when I run the Docker container on my Mac OS, but it consistently happens on the annotator’s Mac OS (also using a Docker container)

Any idea what may be going on and whether that’s likely to do with Prodigy?

Hi! Thanks for the report and the very detailed analysis :+1:

Is it possible that serving Prodigy in your Docker environment somehow causes the app to be loaded twice or refreshed automatically, or something like that? When you load the Prodigy app, it’ll make a request to /get_questions to fetch the next batch of questions. If you reload the window or another person accesses the app from somewhere else, the next request to /get_questions will fetch the next batch of examples (by default, the batch size is 10). So if something already consumes the first batch, this could explain that your annotator is only seeing the second batch.

From Prodigy’s perspective, it only knows that it sent out that batch and hasn’t received the answers yet – but it can’t know that they’re not coming back. (Maybe it went out to a different person, maybe the user is currently offline etc.) So when the next batch is requested, it sends out the next batch.

This sounds like a very good explanation to me! This means, if I stop the current annotation process and restart it again, these missing annotations would get picked up on (given I use the same DB, of course)? That would be an easy experiment to conduct. Thanks! I’ll see if I can test this hypothesis with my annotator. :slight_smile:

Yes, exactly – and, of course, given that there’s nothing in the setup that just automatically reloads the app (because in that case, the annotator would just always get to see the second batch, no matter what).

If you aren’t doing this already, you can also set the environment variable PRODIGY_LOGGING=basic. This will log everything that’s going on under the hood, including the API requests and responses. If our hypothesis is correct, you should be seeing Prodigy respond to /get_questions twice after startup (and before it has received anything back from /give_answers).

If the underlying problem turns out difficult to debug, you could also work around it by making your stream infinite. On each iteration, you can then load the data and send out examples whose hashes aren’t yet in the dataset. So if a batch is dropped in the first iteration (for whatever reason), it’ll be queued up again in the second. If an iteration doesn’t send out un-annotated examples anymore, you can break and you know that no examples are missing. You can find more details and an example implementation here: "No tasks available" on page refresh

Thank you, Ines! I think the refreshing was the issue. I did another annotation round, explicitly asking not to refresh, and all annotations are there except for 1, which was likely skipped by the annotator. I will enable logging though next time, just in case.
I also like the idea of using an infinite stream. I could implement this solution just to be sure no data gets lost. I assume skipped samples receive some sort of mark and are not presented again?

Thanks for the update – that’s good to know! :+1:

By skipping you mean the user clicking the “ignore” button, right? If so, then yes – under the hood, the “ignore” decision is treated just like “accept” and “reject”. So it’s just another answer, and the task will be sent back with "answer": "ignore". Ignored examples will be automatically excluded from training etc., but they’re still present in the dataset (because it’s still useful information and you might want to go over all ignored examples again at a later point). TL;DR: Ignoring an example counts as having annotated it.

Great, thanks!!