Repeating examples when using `--exclude` with `ner.correct`

rdonaldson · December 3, 2020, 8:40pm

Hi All,

When I use the --exclude flag in ner.correct I see the same 26 examples over and over. When I shut down the server and restart, I get a new set of 26 examples and can continue annotating, but restarting the annotation process every 26 examples doesn't seem like a workflow that is in the spirit of prodi.gy. Am I doing something wrong?

In the spirit of Ines' "ingredients" NER model, I'm running a command like this (names changed to protect the innocent):

(prodigy-1.10.5) [~/work]$ prodigy ner.correct dataset_2 ./tmp_model prepped-data.jsonl --label FRUIT,VEG,MEAT,DAIRY,GRAIN --exclude dataset_1

The same command without the --exclude argument serves all the examples in prepped-data.jsonl (but of course doesn't exclude the examples in dataset_1).

Here's my prodigy.json config (most are default settings):

{
  "theme": "basic",
  "custom_theme": {},
  "buttons": ["accept", "reject", "ignore", "undo"],
  "batch_size": 10,
  "history_size": 10,
  "port": 8080,
  "host": "0.0.0.0",
  "cors": true,
  "db": "sqlite",
  "db_settings": {},
  "api_keys": {},
  "validate": true,
  "auto_exclude_current": true,
  "instant_submit": false,
  "feed_overlap": false,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": false,
  "hide_meta": false,
  "show_flag": false,
  "instructions": false,
  "swipe": false,
  "split_sents_threshold": false,
  "html_template": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false,
  "exclude_by": "input"
}

I'm running Prodigy 1.10.5 on Linux, Python 3.6.9.

I'm working around this by piping my data in using bash tail thusly:

(prodigy-1.10.5) [~/work]$ tail -n +100 prepped-data.jsonl | prodigy ner.correct dataset_2 ./tmp_model - -l FRUIT,VEG,MEAT,DAIRY,GRAIN

I increment the -n argument for each batch of data to correct, but this still breaks the flow a little. Is there a way to use --exclude and see a full stream of data using prodi.gy alone?

Thanks for your help!

ines · December 4, 2020, 1:55am

Hi! That definitely doesn't sound like the intended behaviour. Could you double-check whether the incoming examples end up with the same hashes as the examples already in the dataset that's excluded? (Quick debugging tip: if you make Prodigy execute any JavaScript, like "javascript": "console.log('js!')", you can log window.prodigy.content in your developer console to see the JSON of whatever is currently on the screen).

If the hashes are the same, it's more likely related to the stream orchestration In that case, can you check if setting "force_stream_order": false solves the problem?

rdonaldson · December 4, 2020, 8:25pm

Thanks for the tips! (And for being so quick!)

I tried out your debugging tip and was able to confirm that the hashes of the examples that repeat over and over indeed are unchanged when I see them on subsequent cycles through the set of repeating examples.

So...setting "force_stream_order": false does indeed solve the problem!

This doesn't exactly feel like "case closed" since "force_stream_order": false is the default, and even if it were not, the docs suggest that setting "force_stream_order": true should still register annotated examples as "finished". However, this does solve my problem, so thank-you! I hope it solves somebody else's problem too!

ines · December 8, 2020, 5:27am

Yes, this is definitely not the "solution", but it helps narrow in on the underlying problem. (The recipe in this case sets "force_stream_order": True, so it sounds like you've hit some edge case here that makes Prodigy believe that a batch hasn't been annotated yet, even though it's present in the dataset.)

jdagdelen · May 22, 2021, 1:41am

We were able to recreate this bug exactly (up to the 26 and repeat) on our setup as well.

Topic		Replies	Views
ner.correct --exclude not excluding duplicate tasks bug , ner	17	1827	December 7, 2021
ner.correct examples repeat ner , done	5	398	December 30, 2021
ner.teach does not exclude dataset even after '--exclude' usage , ner	4	584	February 6, 2019
Exclude not functioning / duplicate tasks done , streams	6	1693	July 21, 2020
Resuming annotation with a model in the loop usage , solved	2	1310	March 6, 2018

Repeating examples when using `--exclude` with `ner.correct`

Related topics