NER Prodigy to IOB2 format

snu-ceyda · August 4, 2020, 8:23am

using data-to-spacy with a --base-model gives alignment warnings and complains that there is no sentencizer (I don't want to use one) and splits into sentences anyway...

How can I go from prodigy->conll (BIO2) format (not BILUO)?

ines · August 4, 2020, 1:35pm

If you want to export your data for use with spaCy, spaCy's training format always expects sentences under the hood. You can see an example of this here: https://spacy.io/api/annotation#json-input

If your annotations don't align to the base model's tokenization and you don't need annotations in spaCy's training format, I'm not sure data-to-spacy is a good fit here? If you've collected annotations with a manual workflow like ner.manual or ner.correct, the data you can export with db-out gives you the "tokens" and character offsets of each annotated spans, as well as their start and end token IDs. So the start token is B, all other tokens in the span are I and all other tokens are O.

If you only have character offsets, spaCy can also generate IOB tags out-of-the-box: https://prodi.gy/docs/named-entity-recognition#tip-offsets-biluo

Topic		Replies	Views
convert prodigy annotation file to iob format usage , ner , solved , transformers	2	2816	April 16, 2020
Ner format to CONLL usage , ner , solved	7	5365	June 4, 2019
BIO (E/S) encodings for prodigy annotations in sequence labeling applications ner	3	1149	May 23, 2018
BILUO or IOB ? usage , ner , spacy , solved	3	2099	November 15, 2018
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1316	July 2, 2021

NER Prodigy to IOB2 format

Related topics