How to edit existing texts that were added to a dataset using db-in

mihaivinaga · February 1, 2020, 2:48pm

I have been using dataturks.com in the past and now I am trying out prodigy.

I have converted my dataturks json file to the format that prodigy uses and I have added it into a dataset using the following command:

prodigy db-in dataset_name output.jsonl --rehash --overwrite

Now, lets say I want to search for a particular text within this dataset to change something, is that something I can do? Probably the main question that I am asking is, can I modify annotations after I've added them to the dataset using db-in? Also the search feature would be cool too.

Off topic, you guys have built an amazing tool, we've been using it for all our NER tasks.

Regards
Mihai Vinaga

ines · February 2, 2020, 1:11pm

Hey, and thanks!

Datasets in Prodigy are append-only by design – I've written some more about that concept on this thread:

changing annotations in DB via the interface

Datasets in Prodigy are append-only by design: you typically don't want to overwrite existing records, because that means you'd lose a datapoint you've collected. And it'd also make it too easy to erase work. Instead, you can re-annotate and correct the data, and save the results to a new dataset. If you make a mistake, you still have the previous data and can start again.

Prodigy's input and output formats are the same – so you can always export a dataset and load the data back in. For example, if you load a manually-annotated NER dataset back into ner.manual , the entities will be pre-highlighted and you can correct them.

If it's possible to automate some of the changes, that's great, too – for instance, if you removed label X from your label scheme, you can iterate over the "spans" and remove all entries that contain "label": "X" before you send them out for correction again.

If you have conflicting annotations that you want to resolve to one final "master corpus", you can also use the review recipe. It takes one or more datasets with one or more sessions and will group annotations on the same input together. So if annotator A has labelled a span and annotator B hasn't, you can see both and decide what the correct answer is (or even label something entirely different by hand).

Prodigy gives you direct access to the datasets via its Python API – so you can use that to implement any filtering or search logic you need, over any fields in the JSON records. You could do a simple keyword search over the "text" values, or do something more complex with regular expressions (or even spaCy if you want a more advanced NLP-powered search ).

from prodigy.components.db import connect

db = connect()
examples = db.get_datasset("dataset_name")
for eg in examples:
    # do something here...

examples here is a list of dictionaries representing the individual examples. If you've found examples you want to edit, you could either export them to a file and re-annotate them (if you want to change entity spans or more complex stuff), or edit them in your script, and then save the result (previously correct examples, edited examples) to a new dataset.

mihaivinaga · February 2, 2020, 3:55pm

Thank you Ines for the response, I can't imagine how you guys are able to develop such an incredible library(spacy) and tool(prodigy) and still be able to answer every "Tom, Dick and Harry".

Lets say I've extracted 10 documents from my dataset1 and placed them in a file called file_for_dataset2, I will use the prodigy ner.manual dataset2 ... file_for_dataset2 command to create a new dataset called dataset2 and correct everything in it.

After I am done, I can run something like this prodigy review dataset1 dataset2, where I can merge the changed of dataset2 in dataset1?

ines · February 3, 2020, 11:38am

Thanks, we try out best It's actually very nice to be close to the developers who are using our tools.

The review recipe only really makes sense if you want to go over all existing annotations, merge them, resolve conflicts and create a "master annotation". For instance, if you have multiple annotators working on the same data with many disagreements.

If you know that your annotations are correct, you could just divide your data in two: the 10 documents you want to re-annotate, and the rest. Then re-annotate the 10 documents, and import the rest (other documents that don't need to be changed) to the same dataset. That should be much quicker.

Topic		Replies	Views
How to modify already saved annotations? usage , database , solved	1	1900	March 12, 2020
changing annotations in DB via the interface usage , ner , front-end	2	1181	December 12, 2019
Edit Saved NER Manual Annotations usage , ner , database , solved	4	1392	September 13, 2018
Reviewing/Editing annotated data usage , review , streams	1	974	June 23, 2020
Modify/reannotate existing documents usage , solved , streams	2	703	January 13, 2021

How to edit existing texts that were added to a dataset using db-in

Related topics