Hi! We have several old datasets annotated by a couple of annotators. We also have a new dataset which includes (already annotated) tasks from the old datasets. We want to annotate the new dataset (it will be annotated by same annotators) but excluding already annotated tasks. But we want this functionality to consider the annotators. For example, if annotator A
completed task T
in the old dataset, so task T
will not be presented for annotator A
during the annotation of the new dataset. While if annotator B
did not complete task T
in old dataset the task T
will be proposed to annotator B
.
So we need some extended "--exclude" flag functionality which will skip done tasks based on annotator. What is the best way to achieve this in Prodigy?
Thank you!
I think for custom logic like that, it's probably best to write your own filter function that uses the hashes and the annotator information to decide whether to send out an example. You probably want to start one Prodigy instance per annotator and maybe a custom recipe that lets you pass in the annotator name or the names of the datasets they've annotated. You can then only send out the examples that haven't been annotated yet. For example, something like this:
from prodigy.components.db import connect
from prodigy import set_hashes
def get_custom_annotator_stream(stream, old_annotator_dataset):
db = connect()
# Get the input hashes of the examples annotated by that annotator
# (alternatively, you could also load all data here and filter it
# by _session_id or any other way)
input_hashes = db.get_input_hashes([old_annotator_dataset])
for eg in examples:
eg = set_hashes(eg)
if eg["_input_hash"] not in input_hashes:
# Only send out examples if it has not been annotated
yield eg
Whether you use the "_task_hash"
or the "_input_hash"
depends on the type of data and whether you want to consider examples with the same text/image/etc. the aame, or whether you want to distinguish between different questions on the same data. See the docs on hashing for details.
- But instead of opening Prodigy instance per annotator, is it possible to create a new unified dataset which will include annotation tasks from the old datasets and new tasks (when the dataset names will be changed from the old names to new name but annotator names will remain the same) and Prodigy will understand which tasks to propose/skip per annotator ? It is what I tried to do here.
- What is the Prodigy default behavior when annotator
A
accesses Prodigy by?session=A
when Prodigy were stopped and executed again second time? Will Prodigy understand which tasks to skip for annotatorA
which already were done by him in the first run? Actually, it is what I want to achieve in the previous question (1)
We have initial dataset, to get its content we execute prodigy db-out test_dec24_merged3 test_dec24_merged3_A
. Here is the content of initial dataset:
{"text":"aaa 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ","meta":{"source":"www.example.com"},"_input_hash":164129426,"_task_hash":682402973,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":["L1"],"answer":"accept"}
{"text":"aaa 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ","meta":{"source":"www.example.com"},"_input_hash":1144725719,"_task_hash":1637647646,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":["L1"],"answer":"accept"}
{"text":"aaa 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ","meta":{"source":"www.example.com"},"_input_hash":164129426,"_task_hash":682402973,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
{"text":"aaa 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ","meta":{"source":"www.example.com"},"_input_hash":1144725719,"_task_hash":1637647646,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
{"text":"aaa 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 ","meta":{"source":"www.example.com"},"_input_hash":15576074,"_task_hash":-932427445,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
{"text":"aaa 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ","meta":{"source":"www.example.com"},"_input_hash":-382853746,"_task_hash":379641888,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
We are interested in tasks from annotator called s1
:
{"text":"aaa 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ","meta":{"source":"www.example.com"},"_input_hash":164129426,"_task_hash":682402973,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":["L1"],"answer":"accept"}
{"text":"aaa 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ","meta":{"source":"www.example.com"},"_input_hash":1144725719,"_task_hash":1637647646,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":["L1"],"answer":"accept"}
Then we execute Prodigy by command:
prodigy textcat.manual test_dec24_merged3 united_input_h10.jsonl --label "L1","L2"
The Prodigy was accessed from browser as annotator s1
by URL: http://0.0.0.0:8080/?session=s1
Here is content of input file 'united_input_h10.jsonl':
{"text": "aaa 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 ", "meta": {"source": "www.example.com"}}
Here is content of our dataset after annotation, we export it by command prodigy db-out test_dec24_merged3 test_dec24_merged3_B
:
{"text":"aaa 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ","meta":{"source":"www.example.com"},"_input_hash":164129426,"_task_hash":682402973,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":["L1"],"answer":"accept"}
{"text":"aaa 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ","meta":{"source":"www.example.com"},"_input_hash":1144725719,"_task_hash":1637647646,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":["L1"],"answer":"accept"}
{"text":"aaa 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 ","meta":{"source":"www.example.com"},"_input_hash":164129426,"_task_hash":682402973,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
{"text":"aaa 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ","meta":{"source":"www.example.com"},"_input_hash":1144725719,"_task_hash":1637647646,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
{"text":"aaa 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 ","meta":{"source":"www.example.com"},"_input_hash":15576074,"_task_hash":-932427445,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
{"text":"aaa 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ","meta":{"source":"www.example.com"},"_input_hash":-382853746,"_task_hash":379641888,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s2","_view_id":"choice","accept":["L2"],"answer":"accept"}
{"text":"aaa 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ","meta":{"source":"www.example.com"},"_input_hash":1820239773,"_task_hash":-2103791298,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":[],"answer":"accept"}
{"text":"aaa 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ","meta":{"source":"www.example.com"},"_input_hash":1084935059,"_task_hash":-1917208248,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":[],"answer":"accept"}
{"text":"aaa 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 ","meta":{"source":"www.example.com"},"_input_hash":-2050076717,"_task_hash":-48850187,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":[],"answer":"accept"}
{"text":"aaa 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 ","meta":{"source":"www.example.com"},"_input_hash":-550827711,"_task_hash":1636659230,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":[],"answer":"accept"}
{"text":"aaa 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 ","meta":{"source":"www.example.com"},"_input_hash":-525040965,"_task_hash":1039894319,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":[],"answer":"accept"}
{"text":"aaa 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 ","meta":{"source":"www.example.com"},"_input_hash":-530563605,"_task_hash":752319968,"options":[{"id":"L1","text":"L1"},{"id":"L2","text":"L2"}],"_session_id":"test_dec24_merged3-s1","_view_id":"choice","accept":[],"answer":"accept"}
We can see that the problem is that the following annotation tasks (from input file 'united_input_h10.jsonl') were skipped by Prodigy and were not displayed to the s1
annotator:
{"text": "aaa 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 ", "meta": {"source": "www.example.com"}}
{"text": "aaa 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ", "meta": {"source": "www.example.com"}}
They are missing in the content of our dataset after annotation process performed by s1
.
Thank you!
What are your settings? Are you setting "feed_overlap"
to true
or false
, and are you setting `"force_stream_order"? And when you ran your experiment, did you verify that all datasets were newly created and empty?
Hi, here is content of prodigy.json:
{ "db": "mysql", "db_settings": { "mysql": { "host": ", "user": "", "passwd": "", "db": "prodigy" } }, "batch_size": 1, "host": "0.0.0.0", "show_stats": true, "show_flag": false, "instructions": "/home/yuri/.prodigy/instructions.html", "custom_theme": {"cardMaxWidth": 2000}, "largeText": 3, "mediumText": 3, "smallText": 3, "javascript":"prodigy.addEventListener('prodigyanswer', event => {const selected = event.detail.task.accept || []; if (!selected.length) {alert('Task with no selected options submitted.')}})",
"force_stream_order": false
}
I think that my datasets were empty and newly created.
Here is description of related problem: Setting "feed_overlap":true lost effect after Prodigy restart
Hi,
We have two annotators s1
and s2
, setting "feed_overlap": true
and dataset with tasks named from 1 to 10.
s1
annotated first 3 tasks, then s2
annotated first 5 tasks, then Prodigy was restarted.
When annotator s1
accessed Prodigy's Web GUI she was presented by the tasks 6,7,8,9,10. So finally we have missed tasks 4,5 for annotator s1
(probably because these tasks 4,5 was annotated by s2
). Is it expected behavior ?
Here is schematic description of the annotation flow when numbers represent tasks and plus sign (+
) represents annotation answer from an annotator. Timeline is up-down.
s1|s2
--|--
+1+
+2+
+3+
4+
5+
---> Prodigy stopped and started
+6
+7
+8
+9
+10
Thank you!
I've merged your threads into one, because it's easier to discuss this jointly.
I think it all comes down to one detail about how the session identifer is and isn't used. The session ID you set in named multi-user sessions is used as an identifier for the annotations created. It's not currently used in the exclude logic. So when you start the server, annotations that are already in the dataset are filtered out, and that stream is then used to distribute questions to annotators.
I think we can probably put together some code for a custom filter that you can plug in that lets you implement feed filtering based on previous annotations and session IDs. But it might still be much more straightforward and easier to reason about to just start separate, isolated instances for each annotator.