Re-labling custom dataset with Prodigy

Hi!

I think the best option for you would be to further preprocess the already labeled texts that you have, convert them into Prodigy format and then use ner.manual to correct them.

The target output format you want to obtain is something like the following. I added newlines for readability here, but ideally you'd have this in a JSONL file without any enters per text example:

{"text":"We are visiting London and Berlin tomorrow", 

"tokens":[{"text":"We","start":0,"end":2,"id":0},
{"text":"are","start":3,"end":6,"id":1},
{"text":"visiting","start":7,"end":15,"id":2},
{"text":"London","start":16,"end":22,"id":3},
{"text":"and","start":23,"end":26,"id":4},
{"text":"Berlin","start":27,"end":33,"id":5},
{"text":"tomorrow","start":34,"end":42,"id":6}], 

"spans": [{"start":16,"end":22,"label":"CITY"},
{"start":27,"end":33,"label":"CITY"}]}

Then if you'd run

 prodigy ner.manual output_db blank:en input_annotated_texts.jsonl -l CITY

You'd get:
afbeelding

And then you can either hit "accept" if it's all good, or correct the annotations/labels first.

To get to the required JSONL format from your IOB annotations, you can use spaCy for the conversion, e.g.:


    vocab = English().vocab
    doc = Doc(vocab, words=["We", "are", "visiting", "London", "and", "Berlin", "tomorrow"], spaces=[True, True, True, True, True, True, False], ents=["O", "O", "O", "B-CITY", "O", "B-CITY", "O"])
    for ent in doc.ents:
        print(ent.start_char, ent.end_char, ent.label_)

Will give you

16 22 CITY
27 33 CITY

Or have a look at some of spaCy's built-in utility tools, eg https://spacy.io/api/top-level#biluo_tags_to_spans

Hope that helps you get started in the right direction! :slight_smile:

1 Like