Is there an elegant way to merge segmented annotations into a single document? Some annotators in my team prefer to use '-U' flag to annotate a document as a whole, while others like them segmented. For some downstream tasks I need to harmonise all annotated documents and have them as unsegmented.
Also, how one can merge segmented pieces which came from the same document? I tried to introduce a meta parameter doc_id, to keep track on the document, but it failed:
Specifically, I am interested in ner.correct recipe and the raw data for annotations in the form:
data = [{"text": "Some text goes here", "meta": {"doc_id": 12345}}
Unless I missed this bit in the prodigy manual, any suggestions on effective management of the annotated data? Thanks!
This sounds like a solid plan. How exactly did it fail?
Any "meta" values or other custom properties on the annotation tasks should be preserved when sentences are split. For instance, your input file in JSONL format could look like this: