I have been reading about re-annotating records and I have not been lucky with finding answers.
I have a base ner model that I created using ner.manual. As I have new records coming in, I will be applying this model to the records and get classification results from them. For the one's that are not classified properly, I would annotate them again and add them to the dataset and re-train the same model again to get updated classification results.
This works well for me.
But in some cases, I would again like to annotate records that already have been annotated. And again re-train the model to update it.
But since these records are already in the database, i will not be able to annotate them again.
Can you suggest a solution to complete this process?
Hi! Prodigy datasets are append-only, so you never overwrite or lose any datapoints. So if you want to re-annotate a dataset, you typically want to save the annotations to a new dataset. If you're using a new dataset, the existing annotated examples also won't be excluded automatically.
There are different ways to go about reannotating, depending on your use case:
Export the data and load it back in. Prodigy's input and output formats are the same, so you can always use existing annotated data as the input text and annotations will be pre-highlighted in the UI so you can correct them if needed.
Use the review recipe. This is helpful if you're dealing with duplicate and potentially conflicting annotations – for instance annotations created by multiple annotators. You'll then be able to see all different versions and create one "master annotation" with the final answer.
Much appreciated. I can try those to see if it works for my use case.
Also is there a way I can reference the database to check if the records are already annotated.
For example, if there is an incoming data file with 200 rows. But already 50 of them are annotated. I want to create a pie chart with number annotated and number of records yet to annotated.
How can I refer to the database to get this done? Or is there any other way to do this?
Yes, the easiest way is to just do it in Python and use the hashes! You can read more about the hashing logic here. The Database has the methods get_input_hashes and get_task_hashes that are more efficient and just give you the hashes.
So if you want to find whether the input text has already been annotated, you could do something like this:
from prodigy.components.db import connect
from prodigy.components.loaders import JSONL
from prodigy import set_hashes
db = connect()
input_hashes = db.get_input_hashes("your_dataset")
data = JSONL("/path/to/your/new/file.jsonl") # or whatever
already_annotated = 0
to_be_annotated = 0
for eg in data:
eg = set_hashes(eg, overwrite=True)
if eg["_input_hash"] in input_hashes:
already_annotated += 1
else:
to_be_annotated += 1