How to convert IOB/BILOU format into jsonl

I have annotated dataset(for NER) in IOB format. Is there straighforward method to convert it to JSONL format ???

IOB file

Alex B-PER
is O
going O
with O
Marty B-PER
Rick I-PER
to O
Los B-LOC
Angeles I-LOC
. O

Alex B-PER
birthday O
is O
April B-DATE
1996 I-DATE
. O

Required output file

{"text":"Alex is goin with Marty Rick to Los Angeles.","_input_hash":-154867549,"_task_hash":-848148549,"tokens":[{"text":"Alex","start":0,"end":4,"id":0,"ws":true},{"text":"is","start":5,"end":7,"id":1,"ws":true},{"text":"goin","start":8,"end":12,"id":2,"ws":true},{"text":"with","start":13,"end":17,"id":3,"ws":true},{"text":"Marty","start":18,"end":23,"id":4,"ws":true},{"text":"Rick","start":24,"end":28,"id":5,"ws":true},{"text":"to","start":29,"end":31,"id":6,"ws":true},{"text":"Los","start":32,"end":35,"id":7,"ws":true},{"text":"Angeles","start":36,"end":43,"id":8,"ws":false},{"text":".","start":43,"end":44,"id":9,"ws":false}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":4,"token_start":0,"token_end":0,"label":"PER"},{"start":18,"end":28,"token_start":4,"token_end":5,"label":"PER"},{"start":32,"end":43,"token_start":7,"token_end":8,"label":"LOC"}],"answer":"accept"}
{"text":"Alex birthday is April 1996.","_input_hash":-1936698937,"_task_hash":-1676226294,"tokens":[{"text":"Alex","start":0,"end":4,"id":0,"ws":true},{"text":"birthday","start":5,"end":13,"id":1,"ws":true},{"text":"is","start":14,"end":16,"id":2,"ws":true},{"text":"April","start":17,"end":22,"id":3,"ws":true},{"text":"1996","start":23,"end":27,"id":4,"ws":false},{"text":".","start":27,"end":28,"id":5,"ws":false}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":4,"token_start":0,"token_end":0,"label":"PER"},{"start":17,"end":27,"token_start":3,"token_end":4,"label":"DATE"}],"answer":"accept"}

Hi! Prodigy's format uses simple character offsets into the text. If you still have the original text or tokenization anymore and only the IOB or BILUO tags, you could use spaCy's offsets_from_biluo_tags helper function to convert the token-based tags to offsets. See here for an example: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets

biluo_tags_to_offsets works only with BILUO tags and not with IOB tags

My ner data have this format in frensh BIO TAG
words: ['Le', '11', 'octobre', '2018', ',', 'Monsieur', 'B****', 'se', 'presente', 'a', 'la', 'consultation', 'avec', 'un', 'tableau', 'de', 'luxation', 'posterieure', 'de', 'genou', 'non', 'deficitaire', '.']
tags: ['O', 'B-DATE', 'I-DATE', 'I-DATE', 'O', 'O', 'B-PATIENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
and i need to let my model highlight suggestions for me

Hi, if you use nlp() to tokenize, the tokenization might not be aligned with your original annotation.

This should work to create a doc from words + IOB tags:

doc = Doc(nlp.vocab, words=words, ents=tags)

And then you can access the entities with token and character offsets through doc.ents.

1 Like

Hi ADRIANE thank you for your feedback
How can I get this expected format using my model


my model prediction results have this format for exemple
['O', 'O', 'B-PATIENT', 'I-PATIENT', 'O', 'O', 'B-DATE']

so I'm trying Converting IOB tags to character offsets
but in your documentation i find only the conversion BILUO

Ah, it looks like that example didn't get updated for spacy v3. We'll get the docs updated soon, but here's the updated example:

from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy.training import biluo_tags_to_offsets

doc = Doc(Vocab(), words=["I", "like", "New", "York"])
tags = ["O", "O", "B-LOC", "L-LOC"]
offsets = biluo_tags_to_offsets(doc, tags)  # [(7, 15, 'LOC')]

If you have IOB tags instead of BILUO tags, you need to convert them first with:

from spacy.training import iob_to_biluo

tags = iob_to_biluo(tags)
1 Like