merging segmented examples (Prodigy ner.correct) and keeping track on documents

Hi all,

Is there an elegant way to merge segmented annotations into a single document? Some annotators in my team prefer to use '-U' flag to annotate a document as a whole, while others like them segmented. For some downstream tasks I need to harmonise all annotated documents and have them as unsegmented.

Also, how one can merge segmented pieces which came from the same document? I tried to introduce a meta parameter doc_id, to keep track on the document, but it failed:

Specifically, I am interested in ner.correct recipe and the raw data for annotations in the form:

data = [{"text": "Some text goes here", "meta": {"doc_id": 12345}}

Unless I missed this bit in the prodigy manual, any suggestions on effective management of the annotated data? Thanks!

This sounds like a solid plan. How exactly did it fail?

Any "meta" values or other custom properties on the annotation tasks should be preserved when sentences are split. For instance, your input file in JSONL format could look like this:

{"text": "Sentence one. Sentence two.", "meta": {"doc_id": 12345}}

If sentence splitting is enabled, this will become two questions, both with the same "meta".

Hi @ines,

Thanks for you quick answer. I double checked my code, and found a mistake, I didn't use a dictionary for meta, but rather just a number:

{"text": "Sentence one. Sentence two.", "doc_id": 12345}

, which is of course incorrect. Now I fixed and works as a charm!

1 Like