BIO (E/S) encodings for prodigy annotations in sequence labeling applications

hannahlindsley · May 23, 2018, 8:25pm

Does prodigy have built-in support for BIO (etc)?

honnibal · May 23, 2018, 9:18pm

We don’t have a converter for that, no. We thought about adding one, but there are a few subtle variations of it (IOB vs IOB2 vs BILOU), and the problems could be hard to debug if it loads the data incorrectly. We also usually encourage stand-off formats, as it avoids a potential source of train/test skew.

I would suggest creating a custom recipe, so that you can control the way spaCy will be tokenizing the text. You’ll want to do something like this:


import Language
from spacy.tokens import Doc

def create_whitespace_tokenizer(nlp):
    def whitespace_tokenizer(string):
        return Doc(nlp.vocab, words=string.split())
    return whitespace_tokenizer

Language.factories['tokenizer'] = create_whitespace_tokenizer

spaCy’s training process is a little unusual among NLP tools, in that we actually use the run-time tokenization during training. Most tools use the gold-standard tokenization and sentence segmentation. However, I think this is mostly because the tools are only ever evaluated with gold-standard segmentation – so this source of train/test skew stays hidden.

It’s usually very helpful to have the model learning from exactly the tokenization you’re going to get at runtime. But if the text comes in an IOB format, we don’t get to tokenize the text during training — so we have to hack the tokenizer, as above.

If you’re not pre-tokenizing the text at runtime, you should probably make sure spaCy’s tokenizer is matching your NER data’s tokenization reasonably well. You can customize it in a variety of ways if necessary: https://spacy.io/usage/linguistic-features#section-tokenization

hannahlindsley · May 23, 2018, 9:25pm

Yeah, we’ve written our own tokenizer for all labeling, training, and prediction. I definitely agree that the tokenization has to be the same, otherwise it’s nearly impossible to evaluate performance of a sequence labeler.

The tokenization-overriding step was not particularly simple to get working with prodigy, though we figured it all out with some help from you guys

I assume I’ll write a couple different “encoders” that take multi-token spans and convert them - something like labels_as_bio(doc) or labels_as_bioe(doc) or whatever. I may be misunderstanding some of your response though… are you saying there’s another way (outside of some encoding system) to train a sequence labeler on multi-token label spans?

hannahlindsley · May 23, 2018, 9:26pm

This reminds me: do you intend to allow intra-token annotation at some point, or is the only way to do that through a tokenizer that splits on every char?

Topic		Replies	Views
spaCy, prodigy, annotation usage , ner , solved	2	722	February 8, 2019
NER Prodigy to IOB2 format usage , ner , spacy	1	1118	August 4, 2020
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	552	March 27, 2020
Migration from spaCy 2.3 to 3.x + Annotating data in prodigy usage , spacy	1	459	August 29, 2021
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1339	March 11, 2020

BIO (E/S) encodings for prodigy annotations in sequence labeling applications

Related topics