Convert annotated NER data to entity "offset format"

LBoss · August 24, 2020, 4:25pm

Dear prodigy team,

I annotated data for NER and I want to follow the example for training from the spaCy website which can be found here:

Guides -> Training models -> NER -> Updating the Named Entity Recognizer

The required input format for the trainset is:

TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

which is used later in nlp.update.

My question is how can I get the above described format which is required for the example ("offset format")? I used the data-to-spacy recipe but it seems to me that that format creates something else which looks like it can be used for the commandline training.

Thanks for your help!

ines · August 24, 2020, 6:06pm

Hi! I think the solution might be easier than you think When you export your annotations with db-out, each annotated example will contain a list of "spans", and each span has a "start" and "end". Those are the entity offsets. (You can see an example of the JSON format here.)

The data-to-spacy command produces JSON-formatted training data in spaCy's format that you can use with the spacy train CLI command.

LBoss · August 25, 2020, 9:01am

Thanks, I extracted them from the jsonl and got the example I wanted to try running.

Topic		Replies	Views
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1301	July 2, 2021
Prodigy annotations to SpaCy train spacy	13	5610	January 31, 2018
combining two annotated datasets usage , ner , spacy , solved	5	1522	July 28, 2020
Formatted data for pre-trained spaCy models ner , spacy	1	331	January 31, 2021
How do I load the output of ner.gold-to-spacy into spacy? ner , spacy , solved	4	963	October 10, 2018

Convert annotated NER data to entity "offset format"

Related topics