Annotated Data output formatting

Shhariar096 · February 12, 2019, 10:18am

I am new to prodigy and this might be a silly question.
i want to annotate PDF/doc/docx CV based on few labels. As prodigy does not support these formats directly, I converted cv data into a jsonl file and used prodigy ner.manual dataset_name en_core_web_sm data.jsonl --label label1,label2,label3 to start prodigy app on localhost.
After annotation i saved the output and retrieved it through db-out

here is the output file structure.
"text": "sample text....",
"_input_hash":000000000,
"_task_hash":0000000000,
"tokens": [{ "text": "abc", "start": 0, "end": 2, "id": 0 },....],
"spans": [{ "start": 0, "end": 16, "token_start": 0, "token_end": 1, "label": "xyz" }...], "answer": "accept"

Now i want span’s content in this format:

{"label":["label_name"],"points":[{"start":00,"end":50,"text":"abc xyz"}]}

is there any mechanism by which i can map multiple tokens content into my label text by using start to end position of the text? Here i am talking about multiple tokens because prodigy detects space separated characters as a token.
suppose, Albert Einstein was a theoretical physicist. is a sentence. My annotator selects Albert Einstein as Name. now i want the output like this:

{"label":["name"],"points":[{"start":00,"end":14,"text":"Albert Einstein"}]}

here Albert and Einstein are 2 tokens with token_id 1 & 2.

ines · February 12, 2019, 11:57am

Hi! I hope I understand your question correctly – but what you describe is pretty much exactly what’s stored in the "spans" of your annotated data?

"start" and "end" are the character offsets into the text, "token_start" and "token_end" are the indices of the start and end tokens (corresponding with the tokens in "tokens") and "label" is the label assigned to that particular span.

Topic		Replies	Views
Loading pre-annotated data that has multiple sub-labels per word usage , spancat	1	605	June 27, 2021
Annotation with WordPiece tokens usage , transformers	3	492	July 30, 2021
Annotation for document segmentation usage , custom , front-end , solved	4	898	March 10, 2020
revising annotation by prodigy--here only one label (DATE) usage , ner , solved	16	1930	May 20, 2019
Annotate text with multiple entities using ner_manual usage , ner	4	876	November 26, 2018

Annotated Data output formatting

Related topics