merging segmented examples (Prodigy ner.correct) and keeping track on documents

Andrey · February 16, 2020, 1:48pm

Hi all,

Is there an elegant way to merge segmented annotations into a single document? Some annotators in my team prefer to use '-U' flag to annotate a document as a whole, while others like them segmented. For some downstream tasks I need to harmonise all annotated documents and have them as unsegmented.

Also, how one can merge segmented pieces which came from the same document? I tried to introduce a meta parameter doc_id, to keep track on the document, but it failed:

Specifically, I am interested in ner.correct recipe and the raw data for annotations in the form:

data = [{"text": "Some text goes here", "meta": {"doc_id": 12345}}

Unless I missed this bit in the prodigy manual, any suggestions on effective management of the annotated data? Thanks!

ines · February 16, 2020, 3:37pm

This sounds like a solid plan. How exactly did it fail?

Any "meta" values or other custom properties on the annotation tasks should be preserved when sentences are split. For instance, your input file in JSONL format could look like this:

{"text": "Sentence one. Sentence two.", "meta": {"doc_id": 12345}}

If sentence splitting is enabled, this will become two questions, both with the same "meta".

Andrey · February 16, 2020, 8:25pm

Hi @ines,

Thanks for you quick answer. I double checked my code, and found a mistake, I didn't use a dictionary for meta, but rather just a number:

{"text": "Sentence one. Sentence two.", "doc_id": 12345}

, which is of course incorrect. Now I fixed and works as a charm!

Topic		Replies	Views
consolidating unsegmented and segmented annotations usage , ner	2	664	February 14, 2022
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
Merging annotations from different datasets usage , ner , database , solved	12	5871	May 28, 2019
Sentence Segmentation and Annotations usage , spacy , legal	2	1539	January 23, 2020
Question about inconsistent labeling between Prodigy and Jupyter notebook	2	225	May 2, 2023

merging segmented examples (Prodigy ner.correct) and keeping track on documents

Related topics