BIO (E/S) encodings for prodigy annotations in sequence labeling applications

Does prodigy have built-in support for BIO (etc)?

We don’t have a converter for that, no. We thought about adding one, but there are a few subtle variations of it (IOB vs IOB2 vs BILOU), and the problems could be hard to debug if it loads the data incorrectly. We also usually encourage stand-off formats, as it avoids a potential source of train/test skew.

I would suggest creating a custom recipe, so that you can control the way spaCy will be tokenizing the text. You’ll want to do something like this:


import Language
from spacy.tokens import Doc

def create_whitespace_tokenizer(nlp):
    def whitespace_tokenizer(string):
        return Doc(nlp.vocab, words=string.split())
    return whitespace_tokenizer

Language.factories['tokenizer'] = create_whitespace_tokenizer

spaCy’s training process is a little unusual among NLP tools, in that we actually use the run-time tokenization during training. Most tools use the gold-standard tokenization and sentence segmentation. However, I think this is mostly because the tools are only ever evaluated with gold-standard segmentation – so this source of train/test skew stays hidden.

It’s usually very helpful to have the model learning from exactly the tokenization you’re going to get at runtime. But if the text comes in an IOB format, we don’t get to tokenize the text during training — so we have to hack the tokenizer, as above.

If you’re not pre-tokenizing the text at runtime, you should probably make sure spaCy’s tokenizer is matching your NER data’s tokenization reasonably well. You can customize it in a variety of ways if necessary: https://spacy.io/usage/linguistic-features#section-tokenization

Yeah, we’ve written our own tokenizer for all labeling, training, and prediction. I definitely agree that the tokenization has to be the same, otherwise it’s nearly impossible to evaluate performance of a sequence labeler.

The tokenization-overriding step was not particularly simple to get working with prodigy, though we figured it all out with some help from you guys :slight_smile:

I assume I’ll write a couple different “encoders” that take multi-token spans and convert them - something like labels_as_bio(doc) or labels_as_bioe(doc) or whatever. I may be misunderstanding some of your response though… are you saying there’s another way (outside of some encoding system) to train a sequence labeler on multi-token label spans?

This reminds me: do you intend to allow intra-token annotation at some point, or is the only way to do that through a tokenizer that splits on every char?