Loading pre-annotated data that has multiple sub-labels per word

Hello,

I currently have pre-annotated data that has words that require multiple labels in a hierarchical format. EX:

Text: "I took tylenol."

Tylenol - Label: Medication
Tylenol - Sub-label: Polar
Tylenol - Sub-label: Generic
etc..

Currently the format to load this in a single label is:

{
'text': 'I took tylenol.',
'tokens': etc.. ,
'spans':[{'start':7,'end':13,'token_start':2,'token_end':2,'label':'Medication'}]
}

This format loaded in using prodigy mark as a JSONL will highlight Tylenol as the medication which is a great first step. How can I edit this format to include the multiple sub-labels on the same word?

Hi!

Traditionally, NER annotation in Prodigy allows only one label per token.

However, for Prodigy 1.11, we've created a new recipe spans.manual that will allow you to annotate overlapping and nested spans. Your input would look something like this (added newlines for readability but those wouldn't be in your JSONL file):

{"text":"I took tylenol.",

"tokens":[{"text":"I","start":0,"end":1,"id":0,"ws":true},
{"text":"took","start":2,"end":6,"id":1,"ws":true},
{"text":"tylenol","start":7,"end":14,"id":2,"ws":false},
{"text":".","start":14,"end":15,"id":3,"ws":false}],

"spans":[{"start":7,"end":14,"token_start":2,"token_end":2,"label":"Medication"},
{"start":7,"end":14,"token_start":2,"token_end":2,"label":"Generic"}]}

And then with

prodigy spans.manual my_output blank:en input.jsonl -l Medication,Generic

those spans would be preannotated:

afbeelding

For more information on the upcoming 1.11 release, currently available as a "nightly" release, see this thread: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans, improved feeds & more

1 Like