Convert annotated NER data to entity "offset format"

Dear prodigy team,

I annotated data for NER and I want to follow the example for training from the spaCy website which can be found here:

Guides -> Training models -> NER -> Updating the Named Entity Recognizer

The required input format for the trainset is:

TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

which is used later in nlp.update.

My question is how can I get the above described format which is required for the example ("offset format")? I used the data-to-spacy recipe but it seems to me that that format creates something else which looks like it can be used for the commandline training.

Thanks for your help!

Hi! I think the solution might be easier than you think :slightly_smiling_face: When you export your annotations with db-out, each annotated example will contain a list of "spans", and each span has a "start" and "end". Those are the entity offsets. (You can see an example of the JSON format here.)

The data-to-spacy command produces JSON-formatted training data in spaCy's format that you can use with the spacy train CLI command.

1 Like

Thanks, I extracted them from the jsonl and got the example I wanted to try running.

1 Like