BERT recipe when using transformer in pipeline?

Hi everyone,

My goal is to train a model for NER and I have question regarding tokenization.
In my pipeline I want to use BERT but I'm not sure if that means that I have to use BERT's tokenizer during annotation.
Is the bert.ner.manual recipe from the docs only supposed to be used if I'm wanting to feed the data to BERT directly, or also if I'm using BERT as part of a spacy NER model?

There is this image int he spacy docs

This makes it seem like there is a separate tokenizer used, whether I'm using a transformer or not. So I'm not sure I should be annotating data as if they were directly fed into BERT.

I'm not sure I'm making myself clear, but I hope someone can help me out.
Any hints would be much appreciated.

No, you don't have to use the BERT tokenizer for NER annotation. In the pipeline diagram above, the transformer component handles the alignment between spacy tokens and BERT tokens underneath, so you can work with only spacy tokens if you'd like. The tokenizer in that diagram is the spacy tokenizer, which is configurable and could theoretically be a wordpiece tokenizer, but typically it's the default rule-based tokenizer in spacy (currently spacy.Tokenizer.v1).

Just on the annotation side of things, most spacy token boundaries correspond to wordpiece token boundaries, so the difference between annotating with BERT wordpiece tokens and spacy tokens is very minor.

If your goal is to train a spacy NER component, then it makes sense to annotate using the spacy tokenization because that corresponds best to how the model will be trained and evaluated. We train all the provided trf pipelines from data aligned to word-level tokens and it's fine. There are a small number of misalignments between the training data and the spacy tokens and also a small number of misalignments between the spacy tokenizer and the wordpiece tokens, but nearly all entity spans align without issues and the NER component is designed to ignore the few cases that are misaligned.

1 Like

Thank you, that clears things up nicely :slight_smile: