Re-labling custom dataset with Prodigy

AnastKuz · June 26, 2021, 10:43am

Hello!
I was wondering if I can load my own dataset (text and labels - IOB scheme at the moment) in Prodigy and if labeled entities will be highlighted?
Text was labeled with prediction by PyTorch model and I wanted to load it in Prodigy, to see labels already highlighted and to check if everything is correctly labeled and if not - correct it.
Is it possible? At the moment I have two files - one with text (sentence per line), the other one only with labels of type: (O O O B-MTRL I-MTRL O O) per line. Do I need to change format to dictionary of text, tokens, spans before loading it with db-in?

SofieVL · June 27, 2021, 3:33pm

Hi!

I think the best option for you would be to further preprocess the already labeled texts that you have, convert them into Prodigy format and then use ner.manual to correct them.

The target output format you want to obtain is something like the following. I added newlines for readability here, but ideally you'd have this in a JSONL file without any enters per text example:

{"text":"We are visiting London and Berlin tomorrow", 

"tokens":[{"text":"We","start":0,"end":2,"id":0},
{"text":"are","start":3,"end":6,"id":1},
{"text":"visiting","start":7,"end":15,"id":2},
{"text":"London","start":16,"end":22,"id":3},
{"text":"and","start":23,"end":26,"id":4},
{"text":"Berlin","start":27,"end":33,"id":5},
{"text":"tomorrow","start":34,"end":42,"id":6}], 

"spans": [{"start":16,"end":22,"label":"CITY"},
{"start":27,"end":33,"label":"CITY"}]}

Then if you'd run

 prodigy ner.manual output_db blank:en input_annotated_texts.jsonl -l CITY

You'd get:
afbeelding

And then you can either hit "accept" if it's all good, or correct the annotations/labels first.

To get to the required JSONL format from your IOB annotations, you can use spaCy for the conversion, e.g.:


    vocab = English().vocab
    doc = Doc(vocab, words=["We", "are", "visiting", "London", "and", "Berlin", "tomorrow"], spaces=[True, True, True, True, True, True, False], ents=["O", "O", "O", "B-CITY", "O", "B-CITY", "O"])
    for ent in doc.ents:
        print(ent.start_char, ent.end_char, ent.label_)

Will give you

16 22 CITY
27 33 CITY

Or have a look at some of spaCy's built-in utility tools, eg https://spacy.io/api/top-level#biluo_tags_to_spans

Hope that helps you get started in the right direction!

AnastKuz · June 28, 2021, 7:49am

Thanks a lot for the detailed answer!

Topic		Replies	Views
Data format for label correction task based on pre-labelled dataset Getting Started	5	351	June 24, 2022
Cant load pre-annotated ner jsonl usage , ner , solved	8	1183	June 24, 2020
NER: Pass annotated data set to Prodigy for validating / small corrections usage , ner , review	1	509	February 20, 2020
CSV with NER classifications to dataset usage	1	1562	December 13, 2018
how to extend an already labeled corpus? usage , ner , solved	5	1085	June 29, 2019

Re-labling custom dataset with Prodigy

Related topics