Re-labling custom dataset with Prodigy

Hello!
I was wondering if I can load my own dataset (text and labels - IOB scheme at the moment) in Prodigy and if labeled entities will be highlighted?
Text was labeled with prediction by PyTorch model and I wanted to load it in Prodigy, to see labels already highlighted and to check if everything is correctly labeled and if not - correct it.
Is it possible? At the moment I have two files - one with text (sentence per line), the other one only with labels of type: (O O O B-MTRL I-MTRL O O) per line. Do I need to change format to dictionary of text, tokens, spans before loading it with db-in?

Hi!

I think the best option for you would be to further preprocess the already labeled texts that you have, convert them into Prodigy format and then use ner.manual to correct them.

The target output format you want to obtain is something like the following. I added newlines for readability here, but ideally you'd have this in a JSONL file without any enters per text example:

{"text":"We are visiting London and Berlin tomorrow", 

"tokens":[{"text":"We","start":0,"end":2,"id":0},
{"text":"are","start":3,"end":6,"id":1},
{"text":"visiting","start":7,"end":15,"id":2},
{"text":"London","start":16,"end":22,"id":3},
{"text":"and","start":23,"end":26,"id":4},
{"text":"Berlin","start":27,"end":33,"id":5},
{"text":"tomorrow","start":34,"end":42,"id":6}], 

"spans": [{"start":16,"end":22,"label":"CITY"},
{"start":27,"end":33,"label":"CITY"}]}

Then if you'd run

 prodigy ner.manual output_db blank:en input_annotated_texts.jsonl -l CITY

You'd get:
afbeelding

And then you can either hit "accept" if it's all good, or correct the annotations/labels first.

To get to the required JSONL format from your IOB annotations, you can use spaCy for the conversion, e.g.:


    vocab = English().vocab
    doc = Doc(vocab, words=["We", "are", "visiting", "London", "and", "Berlin", "tomorrow"], spaces=[True, True, True, True, True, True, False], ents=["O", "O", "O", "B-CITY", "O", "B-CITY", "O"])
    for ent in doc.ents:
        print(ent.start_char, ent.end_char, ent.label_)

Will give you

16 22 CITY
27 33 CITY

Or have a look at some of spaCy's built-in utility tools, eg https://spacy.io/api/top-level#biluo_tags_to_spans

Hope that helps you get started in the right direction! :slight_smile:

1 Like

Thanks a lot for the detailed answer! :slight_smile:

1 Like