textcat.manual Duplicate Samples

Hello I am currently manually annotating a collection of text using the textcat.manual recipe. I'm having two annotators assign labels to the samples in the dataset in order to calculate an inner-annotator agreement utilizing the /?session=id feature. I've noticed that the annotators are receiving the same sample multiple times to annotate even if that annotator has annotated that sample before, once more prodigy seems to continue to present examples as if the dataset is of infinite size. For example. if the dataset has 100 samples, the annotation session will continue after the 100th sample, and the progress bar on prodigy displays an infinity symbol.

This:
image
Versus This (which can been seen in the demo on the prodigy website):
image

I'm using Prodigy 1.9.9

This is my Prodigy.json file:

{
    "port":8081,
    "force_stream_order": true,
    "feed_overlap": true,
    "exclude_by": "input"
}

I've confirmed that the duplicate annotated samples have identical values for both _input_hash and _task_hash so I'm not sure why they are showing up mulitple times in the same session.

Any ideas?

Thanks!

Hi! Thanks for the report – we're currently trying to debug a potential problem with force_stream_order over here on this thread. If we can track it down, it's likely that it's related:

Hello! Yes after running a few tests it looks like the issue is the force_stream_order parameter, setting it to false prevents the duplicates from appearing in the stream.

Ok, I ran into a second issue that may or may not be related. If you drop a dataset then reload it, the samples that were previously annotated in the dataset are still treated as labeled by the annotator and are not sent again.

Example: (With a 100 sample dataset)
prodigy textcat.manual my_dataset data/my_dataset.jsonl --label='MyLable'
Annotator proceeds to annotate 50 examples
An issue is discovered that requires re-annotation of the data
prodigy drop my_dataset - This should clear out the old data
prodigy textcat.manual my_dataset data/my_dataset.jsonl --label='MyLable'
The total reads as 0 in the Prodigy UI however the annotator is only able to annotate 50 samples, not the full 100

Hi @cgreco,

I posted information about a fix for the next release to address duplicates while using force_stream_order: Refresh browser fix with force_stream_order - #11 by justindujardin

I think this is a separate issue related to the drop command. Can you confirm that this happens when you are using named sessions, but not if you exclude the ?session=something argument? :thinking:

@cgreco this sounds the same as what I reported here Dropping dataset from code doesn't properly delete examples

@justindujardin is this something you plan to address in the next release?

Hello @justindujardin! Thanks for your working debugging the issue for the next release! I look forward to when it comes out.

With regards to the second issue with the drop command. This does indeed appear to be the same issue that @geniki has reported. I can confirm that this issue occurs when using named sessions, and goes away when I remove the ?session=something argument.

Thanks for your work!

@cgreco thanks for the confirmation, I think I have a workaround for you. The trouble is that when you use named sessions, it creates the source dataset entries, and another "session" dataset of the form "[dataset_name]-[session_id]". So if you used a dataset "foo" and a session "bar" you would end up with examples in the datasets "foo" and "foo-bar". When you drop "foo" it leaves "foo-bar" behind. So in your case, you can work around the problem by dropping the named session datasets in addition to the regular one:

prodigy drop my-dataset
prodigy drop my-dataset-justin

Gotcha, thanks for the workaround. Is this the intended behavior or something that may be changed in the future? If this is the intended behavior it may be useful to make a note somewhere so that people know they have to clear out the session datasets as well.

Thanks!

Just released v1.9.10, which should fix the underlying problem with force_stream_order (explained in detail by @justindujardin in this post). The only case where a glitch may still be possible with the current implementation is if you hold down a hotkey and rapid fire – but that should also be a pretty unusual scenario.

Good point, I'll add a note to the docs for now! We definitely want to have a clearer link between regular datasets and session datasets in the future so Prodigy can automatically remove "orphaned" session datasets – but this may require some changes to the database model.

2 Likes