convert prodigy annotation file to iob format

Hello,

i want to run BERT-NER with PyTorch, which requires IOB tagging instead. Is there any conversion code from prodigy annotation data format to IOB?

1 Like

Yes, check out the section here for examples of how to create IOB and BILUO tags from character offsets: https://prodi.gy/docs/named-entity-recognition#tip-offsets-biluo

The easiest way is to use spaCy, which lets you add entities as character offsets and then gives you easy access to each token's ent_iob_ and ent_type_ attributes. Just make sure that the tokenization matches the one you used during annotation and use nlp.pipe for faster conversion. So the whole end-to-end script that reads your Prodigy dataset and outputs IOB could look like this:

from prodigy.components.db import connect
import spacy

db = connect()
prodigy_annotations = db.get_dataset("your_ner_dataset")
examples = ((eg["text"], eg) for eg in prodigy_annotations)
nlp = spacy.blank("en")
for doc, eg in nlp.pipe(examples, as_tuples=True):
    doc.ents = [doc.char_span(s["start"], s["end"], s["label"]) for s in eg["spans"]]
    iob_tags = [f"{t.ent_iob_}-{t.ent_type_}" if t.ent_iob_ else "O" for t in doc]
    print(doc.text, iob_tags)  # do something here...
1 Like

Thanks a lot.