Repeating examples when using `--exclude` with `ner.correct`

Hi All,

When I use the --exclude flag in ner.correct I see the same 26 examples over and over. When I shut down the server and restart, I get a new set of 26 examples and can continue annotating, but restarting the annotation process every 26 examples doesn't seem like a workflow that is in the spirit of Am I doing something wrong?

In the spirit of Ines' "ingredients" NER model, I'm running a command like this (names changed to protect the innocent):

(prodigy-1.10.5) [~/work]$ prodigy ner.correct dataset_2 ./tmp_model prepped-data.jsonl --label FRUIT,VEG,MEAT,DAIRY,GRAIN --exclude dataset_1

The same command without the --exclude argument serves all the examples in prepped-data.jsonl (but of course doesn't exclude the examples in dataset_1).

Here's my prodigy.json config (most are default settings):

  "theme": "basic",
  "custom_theme": {},
  "buttons": ["accept", "reject", "ignore", "undo"],
  "batch_size": 10,
  "history_size": 10,
  "port": 8080,
  "host": "",
  "cors": true,
  "db": "sqlite",
  "db_settings": {},
  "api_keys": {},
  "validate": true,
  "auto_exclude_current": true,
  "instant_submit": false,
  "feed_overlap": false,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": false,
  "hide_meta": false,
  "show_flag": false,
  "instructions": false,
  "swipe": false,
  "split_sents_threshold": false,
  "html_template": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false,
  "exclude_by": "input"

I'm running Prodigy 1.10.5 on Linux, Python 3.6.9.

I'm working around this by piping my data in using bash tail thusly:

(prodigy-1.10.5) [~/work]$ tail -n +100 prepped-data.jsonl | prodigy ner.correct dataset_2 ./tmp_model - -l FRUIT,VEG,MEAT,DAIRY,GRAIN

I increment the -n argument for each batch of data to correct, but this still breaks the flow a little. Is there a way to use --exclude and see a full stream of data using alone?

Thanks for your help!

Hi! That definitely doesn't sound like the intended behaviour. Could you double-check whether the incoming examples end up with the same hashes as the examples already in the dataset that's excluded? (Quick debugging tip: if you make Prodigy execute any JavaScript, like "javascript": "console.log('js!')", you can log window.prodigy.content in your developer console to see the JSON of whatever is currently on the screen).

If the hashes are the same, it's more likely related to the stream orchestration :thinking: In that case, can you check if setting "force_stream_order": false solves the problem?

Thanks for the tips! (And for being so quick!)

I tried out your debugging tip and was able to confirm that the hashes of the examples that repeat over and over indeed are unchanged when I see them on subsequent cycles through the set of repeating examples.

So...setting "force_stream_order": false does indeed solve the problem!

This doesn't exactly feel like "case closed" since "force_stream_order": false is the default, and even if it were not, the docs suggest that setting "force_stream_order": true should still register annotated examples as "finished". However, this does solve my problem, so thank-you! I hope it solves somebody else's problem too!

Yes, this is definitely not the "solution", but it helps narrow in on the underlying problem. (The recipe in this case sets "force_stream_order": True, so it sounds like you've hit some edge case here that makes Prodigy believe that a batch hasn't been annotated yet, even though it's present in the dataset.)

We were able to recreate this bug exactly (up to the 26 and repeat) on our setup as well.