data-to-spacy is not using my custom tokenizer

I have created a custom tokenizer (BertTokenizer) which I would like to use in my Transformer training pipeline and annotate some text with Prodigy but when I use the data-to-spacy recipe, I get the following error:

ValueError: [E949] Unable to align tokens for the predicted and reference docs. It is only possible 
to align the docs when both texts are the same except for whitespace and capitalization. The 
predicted tokens start with: ['[CLS]', 'a', 'retrospective', 'study', 'of', 'mcr', '##pc', 
'patients', 'harboring', 'ar']. The reference tokens start with: ['A', 'retrospective', 'study', 
'of', 'mCRPC', 'patients', 'harboring', 'AR', 'copy', 'number'].

It seems that my cutom tokenizer is NOT used to create the reference docs and it IS used to make the predicted docs.

This is the top of my pipeline config:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser","ner","textcat","spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null

[nlp.tokenizer]
@tokenizers:"blue_heron.BertTokenizer.v1"
model = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

[components]

And this is the definition of my tokenizer and registration:

"""Main module."""

from typing import List, Optional, Union, Iterable, Dict, Any, Callable

from tokenizers import BertWordPieceTokenizer
from transformers import AutoTokenizer
from spacy.tokens import Doc
import spacy
import prodigy
from spacy import Language
from spacy.util import registry
from prodigy.types import StreamType
from prodigy.components.loaders import get_stream, JSONL
from prodigy.util import (
    get_labels,
    load_model,
    log
)
class BertTokenizer:
    def __init__(
            self,
            nlp_vocab,
            vocab,
            lowercase=True):
        """Use the huggingface transformer tokenizer BertWordPieceTokenizer"""
        self.vocab = nlp_vocab
        self._tokenizer = BertWordPieceTokenizer(vocab, lowercase=lowercase)

    def __call__(self, text):
        tokens = self._tokenizer.encode(text)
        words = []
        spaces = []
        for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
            words.append(text)
            if i < len(tokens.tokens) - 1:
                # If next start != current end we assume a space in between
                next_start, next_end = tokens.offsets[i + 1]
                spaces.append(next_start > end)
            else:
                spaces.append(True)
        return Doc(self.vocab, words=words, spaces=spaces)

    @classmethod
    def from_pretrained(cls, vocab, model="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", lowercase=True):
        """Create a BertTokenizer using the vocabulary of a pretrained (huggingface) model"""
        tok = AutoTokenizer.from_pretrained(model, lowercase=lowercase)
        return cls(vocab, tok.vocab, lowercase=lowercase)


@registry.tokenizers("blue_heron.BertTokenizer.v1")
def create_tokenizer(model:str) -> Callable[["Language"], BertTokenizer]:
    """Registered function to create a tokenizer. Returns a factory that takes
    the nlp object and returns a Tokenizer instance using the language detaults.
    """

    def tokenizer_factory(nlp: "Language") -> BertTokenizer:
        return BertTokenizer.from_pretrained(
            nlp.vocab,
            model=model
        )

    return tokenizer_factory

And here is some test data

{"text":"resistance to enzalutamide was reported in a patient harboring F876L mutation."}
{"text":"A retrospective study on 29 mCRPC patients progressing on abiraterone treatment  reported abiraterone-resistance in 7 patients harboring AR (H874Y and T877A)  mutations."}
{"text":"A retrospective study on 19 mCRPC patients progressing on enzalutamide treatment  reported enzalutamide-resistance in a patient harboring AR (H874Y)  mutations."}

This was the command I used to create the spacy data:

prodigy data-to-spacy \
        --ner drug_rules_entities --eval-split 0.5 --config ./spacy_drug_rules.cfg \
        -F ${HOME}/Dropbox/CODE/blue_heron_ai/blue_heron_ai/blue_heron_ai.py drug_rules

Where the blue_heron_ai.py file contains the BertTokenizer definition.

Hi there!

I'm working to reproduce your example locally but since you've only given me a partial config I tried to fill it via spaCy. Note that I moved your Python code into a file called tok.py and that I'm referring to it below.

python -m spacy init fill-config config.cfg --code tok.py

This led to an error, because it cannot find a transformer in the components block. Could you share the full config file, just so that I can make sure that I have the same settings as you?

Thanks for looking into this...
Here's the full config

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser","ner","textcat","spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null

[nlp.tokenizer]
@tokenizers:"blue_heron.BertTokenizer.v1"
model = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

[components]

[components.morphologizer]
factory = "morphologizer"
extend = false
overwrite = true
scorer = {"@scorers":"spacy.morphologizer_scorer.v1"}

[components.morphologizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.morphologizer.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.parser]
factory = "parser"
learn_tokens = false
min_action_freq = 30
moves = null
scorer = {"@scorers":"spacy.parser_scorer.v1"}
update_with_oracle_cut_size = 100

[components.parser.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null

[components.parser.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.tagger]
factory = "tagger"
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v2"}
threshold = 0.0

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false
nO = null

[components.textcat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.trainable_lemmatizer]
factory = "trainable_lemmatizer"
backoff = "orth"
min_tree_freq = 3
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
top_k = 1

[components.trainable_lemmatizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.trainable_lemmatizer.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
tag_acc = 0.14
pos_acc = 0.07
morph_acc = 0.07
morph_per_feat = null
lemma_acc = 0.14
dep_uas = 0.07
dep_las = 0.07
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.0
ents_f = 0.14
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
cats_score = 0.14
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
spans_sc_f = 0.14
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Hi Thon.

Just to check in again. How did you annotate the data? Did you use a recipe like this?

prodigy ner.manual <dataset-name> en_core_web_sm examples.jsonl --label bio-label

I'm worried that you annotated using a standard spaCy tokenizer and are trying to export using the custom one.

I used a variant of the ner.manual.bert recipe:

"""Main module."""

from typing import List, Optional, Union, Iterable, Dict, Any, Callable

from tokenizers import BertWordPieceTokenizer
from transformers import AutoTokenizer
from spacy.tokens import Doc
import spacy
import prodigy
from spacy import Language
from spacy.util import registry
from prodigy.types import StreamType
from prodigy.components.loaders import get_stream, JSONL
from prodigy.util import (
    get_labels,
    load_model,
    log
)
class BertTokenizer:
    def __init__(
            self,
            nlp_vocab,
            vocab,
            lowercase=True):
        """Use the huggingface transformer tokenizer BertWordPieceTokenizer"""
        self.vocab = nlp_vocab
        self._tokenizer = BertWordPieceTokenizer(vocab, lowercase=lowercase)

    def __call__(self, text):
        tokens = self._tokenizer.encode(text)
        words = []
        spaces = []
        for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
            words.append(text)
            if i < len(tokens.tokens) - 1:
                # If next start != current end we assume a space in between
                next_start, next_end = tokens.offsets[i + 1]
                spaces.append(next_start > end)
            else:
                spaces.append(True)
        return Doc(self.vocab, words=words, spaces=spaces)

    @classmethod
    def from_pretrained(cls, vocab, model="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", lowercase=True):
        """Create a BertTokenizer using the vocabulary of a pretrained (huggingface) model"""
        tok = AutoTokenizer.from_pretrained(model, lowercase=lowercase)
        return cls(vocab, tok.vocab, lowercase=lowercase)


@registry.tokenizers("blue_heron.BertTokenizer.v1")
def create_tokenizer(model:str) -> Callable[["Language"], BertTokenizer]:
    """Registered function to create a tokenizer. Returns a factory that takes
    the nlp object and returns a Tokenizer instance using the language detaults.
    """

    def tokenizer_factory(nlp: "Language") -> BertTokenizer:
        return BertTokenizer.from_pretrained(
            nlp.vocab,
            model=model
        )

    return tokenizer_factory


def add_tokens(
        nlp:Language,
        stream:StreamType,
        skip:bool=False,
        overwrite:bool=False,
        use_chars:bool=False,
        batch_size:int=1,
        hide_special: bool = False,
        hide_wp_prefix: bool = False) -> StreamType:
        """Special version of the prodigy `add_tokens` function that can deal with Bert-style tokenization
        
        Parameters
        ----------
        
        Result
        ------
          StreamType : Stream with the addition of the tokens to the example stream
        """

        if isinstance(nlp.tokenizer, BertTokenizer):
            tokenizer = nlp.tokenizer._tokenizer
            sep_token = tokenizer._parameters.get("sep_token")
            cls_token = tokenizer._parameters.get("cls_token")
            special_tokens = (sep_token, cls_token)
            wp_prefix = tokenizer._parameters.get("wordpieces_prefix")

            for eg in stream:
                tokens = tokenizer.encode(eg["text"])
                eg_tokens = []
                idx = 0
                for (text, (start, end), tid) in zip(
                    tokens.tokens, tokens.offsets, tokens.ids
                ):
                    # If we don't want to see special tokens, don't add them
                    if hide_special and text in special_tokens:
                        continue
                    # If we want to strip out word piece prefix, remove it from text
                    if hide_wp_prefix and wp_prefix is not None:
                        if text.startswith(wp_prefix):
                            text = text[len(wp_prefix) :]
                    token = {
                        "text": text,
                        "id": idx,
                        "start": start,
                        "end": end,
                        # This is the encoded ID returned by the tokenizer
                        "tokenizer_id": tid,
                        # Don't allow selecting spacial SEP/CLS tokens
                        "disabled": text in special_tokens,
                    }
                    eg_tokens.append(token)
                    idx += 1
                for i, token in enumerate(eg_tokens):
                    # If the next start offset != the current end offset, we
                    # assume there's whitespace in between
                    if i < len(eg_tokens) - 1 and token["text"] not in special_tokens:
                        next_token = eg_tokens[i + 1]
                        token["ws"] = (
                            next_token["start"] > token["end"]
                            or next_token["text"] in special_tokens
                        )
                    else:
                        token["ws"] = True
                eg["tokens"] = eg_tokens
                yield eg
        else:
             return prodigy.components.preprocess.add_tokens(
             nlp,
             stream,
             skip,
             overwrite,
             use_chars,
             batch_size
        )

#####
#
# Prodigy RECIPES
#

@prodigy.recipe(
    "bert.ner.manual",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("The source data as a JSONL file", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    model=("Huggingface model", "option", "m", str),
    lowercase=("Set lowercase=True for tokenizer", "flag", "LC", bool),
    hide_special=("Hide SEP and CLS tokens visually", "flag", "HS", bool),
    hide_wp_prefix=("Hide wordpieces prefix like ##", "flag", "HW", bool)
    # fmt: on
)
def ner_manual_tokenizers_bert(
    dataset: str,
    source: Union[str, Iterable[dict]],
    loader: Optional[str] = None,
    label: Optional[List[str]] = None,
    model: Optional[str] = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
    lowercase: bool = False,
    hide_special: bool = False,
    hide_wp_prefix: bool = False,
) -> Dict[str, Any]:
    """Example recipe that shows how to use model-specific tokenizers like the
    BERT word piece tokenizer to preprocess your incoming text for fast and
    efficient NER annotation and to make sure that all annotations you collect
    always map to tokens and can be used to train and fine-tune your model
    (even if the tokenization isn't that intuitive, because word pieces). The
    selection automatically snaps to the token boundaries and you can double-click
    single tokens to select them.

    Setting "honor_token_whitespace": true will ensure that whitespace between
    tokens is only shown if whitespace is present in the original text. This
    keeps the text readable.

    Requires Prodigy v1.10+ and usese the HuggingFace tokenizers library."""
    log("RECIPE: Starting recipe ner.manual", locals())
    stream = get_stream(source, loader=loader, input_key="text", rehash=True, dedup=True)

    spacy_model = 'en_core_web_trf'
    nlp = load_model(spacy_model)
    nlp.tokenizer = BertTokenizer.from_pretrained(nlp.vocab, model=model, lowercase=lowercase)
    
    stream = add_tokens(nlp, stream, hide_special=hide_special, hide_wp_prefix=hide_wp_prefix)

    blocks = [
        {"view_id": "spans_manual"}
    ]
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {
            "honor_token_whitespace": True,
            "labels": label,
            "blocks": blocks,
            "exclude_by": "input",
            "force_stream_order": True
        },
    }

Then i used this to manual recipt to annotate:

prodigy bert.ner.manual drug_rules_entities evidence_text.jsonl \
        --model microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext \
        --label DRUG,GENE,MUTATION,DISEASE \
        --lowercase --hide-special --hide-wp-prefix \
        -F ${HOME}/Dropbox/CODE/blue_heron_ai/blue_heron_ai/blue_heron_ai.py
        

prodigy data-to-spacy \
        --ner drug_rules_entities --eval-split 0.5 --config ./spacy_drug_rules.cfg \
        -F ${HOME}/Dropbox/CODE/blue_heron_ai/blue_heron_ai/blue_heron_ai.py drug_rules

I followed your steps. First I annotated some data via python -m prodigy bert.ner.manual and then I export via prodigy data-to-spacy then I get the same error message.

ValueError: [E949] Unable to align tokens for the predicted and reference docs. It is only possible to align the docs when both texts are the same except for whitespace and capitalization. The predicted tokens start with: ['[CLS]', 'a', 'retrospective', 'study', 'on', '29', 'mcr', '##pc', 'patients', 'progressing']. The reference tokens start with: ['A', 'retrospective', 'study', 'on', '29', 'mCRPC', 'patients', 'progressing', 'on', 'abiraterone'].

I figured I'd have a look at the two token sequences that it's complaining about.

['[CLS]', 'a', 'retrospective', 'study', 'on', '29', 'mcr', '##pc', 'patients', 'progressing']

and

['A', 'retrospective', 'study', 'on', '29', 'mCRPC', 'patients', 'progressing', 'on', 'abiraterone'].

This is different on two levels; one has an extra [CLS] token, but there's also a capitalisation difference between a and A. So that suggests that there are indeed two different tokenisers at play here.

Looking deeper into your code, I also spot some differences.

Line 48 uses this:

This is used in the BertTokenizer.from_pretrained classmethod.

AutoTokenizer.from_pretrained(model, lowercase=lowercase)

Line 59 uses this:

This is used in the tokenizer_factory inner function for the tokeniser factory.

BertTokenizer.from_pretrained(nlp.vocab,model=model)

Line 189 uses this:

This is the tokeniser that you attach to the nlp object in the recipe.

BertTokenizer.from_pretrained(nlp.vocab, model=model, lowercase=lowercase)

So from my perspective, it seems like there are indeed subtle differences that could be causing the differences that you're seeing. The AutoTokenizer does not use the nlp.vocab and there's also a lowercase inconsistency. As a next step, it would probably be best to retry this exercise but to make sure that all the tokenisers are loaded with the exact same Python function.

If you need extra help, do let me know, but I figured this would be a good point for you to revisit your codebase. There might be some considerations that I'm skipping over. Partially because I'm less involved with your dataset but also I'm not super familiar with training my own BERT models.

Let me know!

Thanks for looking into this...The lowercase is a default setting for BertTokenizer so that ain't it and indeed it is likely the problem that Autotokenizer does not use the vocab, but since that is huggingface API I am not sure how to fix that, but will have a look at it...Anyone have any ideas?

Can't you replace the AutoTokenizer with the BertTokenizer?