Hello,
i want to run BERT-NER with PyTorch, which requires IOB tagging instead. Is there any conversion code from prodigy annotation data format to IOB?
Hello,
i want to run BERT-NER with PyTorch, which requires IOB tagging instead. Is there any conversion code from prodigy annotation data format to IOB?
Yes, check out the section here for examples of how to create IOB and BILUO tags from character offsets: https://prodi.gy/docs/named-entity-recognition#tip-offsets-biluo
The easiest way is to use spaCy, which lets you add entities as character offsets and then gives you easy access to each token's ent_iob_
and ent_type_
attributes. Just make sure that the tokenization matches the one you used during annotation and use nlp.pipe
for faster conversion. So the whole end-to-end script that reads your Prodigy dataset and outputs IOB could look like this:
from prodigy.components.db import connect
import spacy
db = connect()
prodigy_annotations = db.get_dataset("your_ner_dataset")
examples = ((eg["text"], eg) for eg in prodigy_annotations)
nlp = spacy.blank("en")
for doc, eg in nlp.pipe(examples, as_tuples=True):
doc.ents = [doc.char_span(s["start"], s["end"], s["label"]) for s in eg["spans"]]
iob_tags = [f"{t.ent_iob_}-{t.ent_type_}" if t.ent_iob_ else "O" for t in doc]
print(doc.text, iob_tags) # do something here...
Thanks a lot.