I am new to prodigy and this might be a silly question.
i want to annotate PDF/doc/docx CV based on few labels. As prodigy does not support these formats directly, I converted cv data into a jsonl file and used prodigy ner.manual dataset_name en_core_web_sm data.jsonl --label label1,label2,label3 to start prodigy app on localhost.
After annotation i saved the output and retrieved it through db-out
here is the output file structure.
"text": "sample text....",
"_input_hash":000000000,
"_task_hash":0000000000,
"tokens": [{ "text": "abc", "start": 0, "end": 2, "id": 0 },....],
"spans": [{ "start": 0, "end": 16, "token_start": 0, "token_end": 1, "label": "xyz" }...], "answer": "accept"
Now i want span’s content in this format:
{"label":["label_name"],"points":[{"start":00,"end":50,"text":"abc xyz"}]}
is there any mechanism by which i can map multiple tokens content into my label text by using start to end position of the text? Here i am talking about multiple tokens because prodigy detects space separated characters as a token.
suppose, Albert Einstein was a theoretical physicist. is a sentence. My annotator selects Albert Einstein as Name. now i want the output like this:
{"label":["name"],"points":[{"start":00,"end":14,"text":"Albert Einstein"}]}
here Albert and Einstein are 2 tokens with token_id 1 & 2.