Loading pre-annotated data that has multiple sub-labels per word

Hi!

Traditionally, NER annotation in Prodigy allows only one label per token.

However, for Prodigy 1.11, we've created a new recipe spans.manual that will allow you to annotate overlapping and nested spans. Your input would look something like this (added newlines for readability but those wouldn't be in your JSONL file):

{"text":"I took tylenol.",

"tokens":[{"text":"I","start":0,"end":1,"id":0,"ws":true},
{"text":"took","start":2,"end":6,"id":1,"ws":true},
{"text":"tylenol","start":7,"end":14,"id":2,"ws":false},
{"text":".","start":14,"end":15,"id":3,"ws":false}],

"spans":[{"start":7,"end":14,"token_start":2,"token_end":2,"label":"Medication"},
{"start":7,"end":14,"token_start":2,"token_end":2,"label":"Generic"}]}

And then with

prodigy spans.manual my_output blank:en input.jsonl -l Medication,Generic

those spans would be preannotated:

afbeelding

For more information on the upcoming 1.11 release, currently available as a "nightly" release, see this thread: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans, improved feeds & more

1 Like