How to convert IOB/BILOU format into jsonl

sh123 · May 28, 2021, 3:31pm

I have annotated dataset(for NER) in IOB format. Is there straighforward method to convert it to JSONL format ???

IOB file

Alex B-PER
is O
going O
with O
Marty B-PER
Rick I-PER
to O
Los B-LOC
Angeles I-LOC
. O

Alex B-PER
birthday O
is O
April B-DATE
1996 I-DATE
. O

Required output file

{"text":"Alex is goin with Marty Rick to Los Angeles.","_input_hash":-154867549,"_task_hash":-848148549,"tokens":[{"text":"Alex","start":0,"end":4,"id":0,"ws":true},{"text":"is","start":5,"end":7,"id":1,"ws":true},{"text":"goin","start":8,"end":12,"id":2,"ws":true},{"text":"with","start":13,"end":17,"id":3,"ws":true},{"text":"Marty","start":18,"end":23,"id":4,"ws":true},{"text":"Rick","start":24,"end":28,"id":5,"ws":true},{"text":"to","start":29,"end":31,"id":6,"ws":true},{"text":"Los","start":32,"end":35,"id":7,"ws":true},{"text":"Angeles","start":36,"end":43,"id":8,"ws":false},{"text":".","start":43,"end":44,"id":9,"ws":false}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":4,"token_start":0,"token_end":0,"label":"PER"},{"start":18,"end":28,"token_start":4,"token_end":5,"label":"PER"},{"start":32,"end":43,"token_start":7,"token_end":8,"label":"LOC"}],"answer":"accept"}
{"text":"Alex birthday is April 1996.","_input_hash":-1936698937,"_task_hash":-1676226294,"tokens":[{"text":"Alex","start":0,"end":4,"id":0,"ws":true},{"text":"birthday","start":5,"end":13,"id":1,"ws":true},{"text":"is","start":14,"end":16,"id":2,"ws":true},{"text":"April","start":17,"end":22,"id":3,"ws":true},{"text":"1996","start":23,"end":27,"id":4,"ws":false},{"text":".","start":27,"end":28,"id":5,"ws":false}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":4,"token_start":0,"token_end":0,"label":"PER"},{"start":17,"end":27,"token_start":3,"token_end":4,"label":"DATE"}],"answer":"accept"}

ines · June 2, 2021, 1:23am

Hi! Prodigy's format uses simple character offsets into the text. If you still have the original text or tokenization anymore and only the IOB or BILUO tags, you could use spaCy's offsets_from_biluo_tags helper function to convert the token-based tags to offsets. See here for an example: https://prodi.gy/docs/named-entity-recognition#tip-biluo-offsets

elazzouzi1080 · November 3, 2021, 2:32pm

biluo_tags_to_offsets works only with BILUO tags and not with IOB tags

My ner data have this format in frensh BIO TAG
words: ['Le', '11', 'octobre', '2018', ',', 'Monsieur', 'B****', 'se', 'presente', 'a', 'la', 'consultation', 'avec', 'un', 'tableau', 'de', 'luxation', 'posterieure', 'de', 'genou', 'non', 'deficitaire', '.']
tags: ['O', 'B-DATE', 'I-DATE', 'I-DATE', 'O', 'O', 'B-PATIENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
and i need to let my model highlight suggestions for me

adriane · November 3, 2021, 2:45pm

Hi, if you use nlp() to tokenize, the tokenization might not be aligned with your original annotation.

This should work to create a doc from words + IOB tags:

doc = Doc(nlp.vocab, words=words, ents=tags)

And then you can access the entities with token and character offsets through doc.ents.

elazzouzi1080 · November 3, 2021, 4:58pm

Hi ADRIANE thank you for your feedback
How can I get this expected format using my model

my model prediction results have this format for exemple
['O', 'O', 'B-PATIENT', 'I-PATIENT', 'O', 'O', 'B-DATE']

so I'm trying Converting IOB tags to character offsets
but in your documentation i find only the conversion BILUO

adriane · November 4, 2021, 9:03am

Ah, it looks like that example didn't get updated for spacy v3. We'll get the docs updated soon, but here's the updated example:

from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy.training import biluo_tags_to_offsets

doc = Doc(Vocab(), words=["I", "like", "New", "York"])
tags = ["O", "O", "B-LOC", "L-LOC"]
offsets = biluo_tags_to_offsets(doc, tags)  # [(7, 15, 'LOC')]

If you have IOB tags instead of BILUO tags, you need to convert them first with:

from spacy.training import iob_to_biluo

tags = iob_to_biluo(tags)

Topic		Replies	Views
Convert spacy binary data to jsonl ner , spacy , solved , nightly	5	4199	April 28, 2022
prodigy ner train error iob translated to json annotation data usage , ner , training	3	617	March 28, 2022
convert prodigy annotation file to iob format usage , ner , solved , transformers	2	2809	April 16, 2020
CONLL to Prodigy usage , done	1	703	June 22, 2018
Convert DocBins or .spacy files to .jsonl format usage , ner , spacy	2	827	January 3, 2023

How to convert IOB/BILOU format into jsonl

Related topics