Hello!
I was wondering if I can load my own dataset (text and labels - IOB scheme at the moment) in Prodigy and if labeled entities will be highlighted?
Text was labeled with prediction by PyTorch model and I wanted to load it in Prodigy, to see labels already highlighted and to check if everything is correctly labeled and if not - correct it.
Is it possible? At the moment I have two files - one with text (sentence per line), the other one only with labels of type: (O O O B-MTRL I-MTRL O O) per line. Do I need to change format to dictionary of text, tokens, spans before loading it with db-in?
Hi!
I think the best option for you would be to further preprocess the already labeled texts that you have, convert them into Prodigy format and then use ner.manual
to correct them.
The target output format you want to obtain is something like the following. I added newlines for readability here, but ideally you'd have this in a JSONL file without any enters per text example:
{"text":"We are visiting London and Berlin tomorrow",
"tokens":[{"text":"We","start":0,"end":2,"id":0},
{"text":"are","start":3,"end":6,"id":1},
{"text":"visiting","start":7,"end":15,"id":2},
{"text":"London","start":16,"end":22,"id":3},
{"text":"and","start":23,"end":26,"id":4},
{"text":"Berlin","start":27,"end":33,"id":5},
{"text":"tomorrow","start":34,"end":42,"id":6}],
"spans": [{"start":16,"end":22,"label":"CITY"},
{"start":27,"end":33,"label":"CITY"}]}
Then if you'd run
prodigy ner.manual output_db blank:en input_annotated_texts.jsonl -l CITY
You'd get:
And then you can either hit "accept" if it's all good, or correct the annotations/labels first.
To get to the required JSONL format from your IOB annotations, you can use spaCy
for the conversion, e.g.:
vocab = English().vocab
doc = Doc(vocab, words=["We", "are", "visiting", "London", "and", "Berlin", "tomorrow"], spaces=[True, True, True, True, True, True, False], ents=["O", "O", "O", "B-CITY", "O", "B-CITY", "O"])
for ent in doc.ents:
print(ent.start_char, ent.end_char, ent.label_)
Will give you
16 22 CITY
27 33 CITY
Or have a look at some of spaCy's built-in utility tools, eg https://spacy.io/api/top-level#biluo_tags_to_spans
Hope that helps you get started in the right direction!
Thanks a lot for the detailed answer!