I'm trying to use the review recipe to look through and correct past annotations for a dataset. This works fine if im using a separate destination dataset from my source, but I want to save review results back to source. I tried running review with the same source and destination datasets like shown below:
Hi! I'm not sure what your end goal is, but in general, you should always use a different dataset to save your final reviewed corpus – otherwise, you end up with duplicate and inconsistent data. Datasets in Prodigy are append-only by design so you never lose any data points (because overwriting your annotations by accident would be bad).
The review recipe will create a final copy of the examples with the versions it was based on and your final decision. So you typically want to have that in a separate dataset that you can then train from, not mixed in with your original annotations. (If you really want to, you can always remove your original annotations later – although I'm not sure that's really necessary.)
If you ended up with your one dataset containing mixed annotations of differnt types, the easiest solution would be to export the data using db-out, removing the lines added from the review, and re-uploading the data with db-in. You can then start again with a separate review dataset.
Hi!
I stopped at the "remove added lines from revision and send data" part, but I couldn't find the "new lines from revision" in the structure. What are these review lines like?
Thanks for your question and sorry about the delay to get back to you!
Probably the easiest you could identify the "new lines (aka annotations) from the revision" is by filtering by "_session_id".
The session ID is assigned when a user opens the app and makes a request to the server requesting a new batch of questions. By default it will have a time stamped session ID. Alternatively, you can use the ?session=my_review where it would set the _session_id to my_review plus the dataset you're reviewing (e.g., ner_data).
Also instead of outputting the file using db-out you could use the database components to directly pull your annotations to filter records for that session, create a new dataset (my_review_dataset) for that session, then add those records to that dataset.
from prodigy.components.db import connect
db = connect()
examples = db.get_dataset("ner_data")
# get only my_review session
my_review = [eg for eg in examples if eg.get("_session_id") == "ner_data-my_review"]
db.add_dataset("my_review_dataset", session=True)
db.add_examples(my_review, datasets=["my_review_dataset"])