Re-labling custom dataset with Prodigy

SofieVL · June 27, 2021, 3:33pm

Hi!

I think the best option for you would be to further preprocess the already labeled texts that you have, convert them into Prodigy format and then use ner.manual to correct them.

The target output format you want to obtain is something like the following. I added newlines for readability here, but ideally you'd have this in a JSONL file without any enters per text example:

{"text":"We are visiting London and Berlin tomorrow", 

"tokens":[{"text":"We","start":0,"end":2,"id":0},
{"text":"are","start":3,"end":6,"id":1},
{"text":"visiting","start":7,"end":15,"id":2},
{"text":"London","start":16,"end":22,"id":3},
{"text":"and","start":23,"end":26,"id":4},
{"text":"Berlin","start":27,"end":33,"id":5},
{"text":"tomorrow","start":34,"end":42,"id":6}], 

"spans": [{"start":16,"end":22,"label":"CITY"},
{"start":27,"end":33,"label":"CITY"}]}

Then if you'd run

 prodigy ner.manual output_db blank:en input_annotated_texts.jsonl -l CITY

You'd get:
afbeelding

And then you can either hit "accept" if it's all good, or correct the annotations/labels first.

To get to the required JSONL format from your IOB annotations, you can use spaCy for the conversion, e.g.:


    vocab = English().vocab
    doc = Doc(vocab, words=["We", "are", "visiting", "London", "and", "Berlin", "tomorrow"], spaces=[True, True, True, True, True, True, False], ents=["O", "O", "O", "B-CITY", "O", "B-CITY", "O"])
    for ent in doc.ents:
        print(ent.start_char, ent.end_char, ent.label_)

Will give you

16 22 CITY
27 33 CITY

Or have a look at some of spaCy's built-in utility tools, eg https://spacy.io/api/top-level#biluo_tags_to_spans

Hope that helps you get started in the right direction!

Topic		Replies	Views
Trailing data usage , solved	2	813	July 14, 2021
Mismatching spans usage , ner , solved	3	336	July 15, 2021
Data format for label correction task based on pre-labelled dataset Getting Started	5	348	June 24, 2022
Cant load pre-annotated ner jsonl usage , ner , solved	8	1182	June 24, 2020
Does Prodigy load pre-annotated data? usage , ner , solved	23	2637	October 25, 2018

Re-labling custom dataset with Prodigy

Related topics