I have created a custom tokenizer (BertTokenizer) which I would like to use in my Transformer training pipeline and annotate some text with Prodigy but when I use the data-to-spacy
recipe, I get the following error:
ValueError: [E949] Unable to align tokens for the predicted and reference docs. It is only possible
to align the docs when both texts are the same except for whitespace and capitalization. The
predicted tokens start with: ['[CLS]', 'a', 'retrospective', 'study', 'of', 'mcr', '##pc',
'patients', 'harboring', 'ar']. The reference tokens start with: ['A', 'retrospective', 'study',
'of', 'mCRPC', 'patients', 'harboring', 'AR', 'copy', 'number'].
It seems that my cutom tokenizer is NOT used to create the reference docs
and it IS used to make the predicted docs
.
This is the top of my pipeline config:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = "pytorch"
seed = 0
[nlp]
lang = "en"
pipeline = ["transformer","tagger","morphologizer","trainable_lemmatizer","parser","ner","textcat","spancat"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
[nlp.tokenizer]
@tokenizers:"blue_heron.BertTokenizer.v1"
model = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
[components]
And this is the definition of my tokenizer and registration:
"""Main module."""
from typing import List, Optional, Union, Iterable, Dict, Any, Callable
from tokenizers import BertWordPieceTokenizer
from transformers import AutoTokenizer
from spacy.tokens import Doc
import spacy
import prodigy
from spacy import Language
from spacy.util import registry
from prodigy.types import StreamType
from prodigy.components.loaders import get_stream, JSONL
from prodigy.util import (
get_labels,
load_model,
log
)
class BertTokenizer:
def __init__(
self,
nlp_vocab,
vocab,
lowercase=True):
"""Use the huggingface transformer tokenizer BertWordPieceTokenizer"""
self.vocab = nlp_vocab
self._tokenizer = BertWordPieceTokenizer(vocab, lowercase=lowercase)
def __call__(self, text):
tokens = self._tokenizer.encode(text)
words = []
spaces = []
for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
words.append(text)
if i < len(tokens.tokens) - 1:
# If next start != current end we assume a space in between
next_start, next_end = tokens.offsets[i + 1]
spaces.append(next_start > end)
else:
spaces.append(True)
return Doc(self.vocab, words=words, spaces=spaces)
@classmethod
def from_pretrained(cls, vocab, model="microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", lowercase=True):
"""Create a BertTokenizer using the vocabulary of a pretrained (huggingface) model"""
tok = AutoTokenizer.from_pretrained(model, lowercase=lowercase)
return cls(vocab, tok.vocab, lowercase=lowercase)
@registry.tokenizers("blue_heron.BertTokenizer.v1")
def create_tokenizer(model:str) -> Callable[["Language"], BertTokenizer]:
"""Registered function to create a tokenizer. Returns a factory that takes
the nlp object and returns a Tokenizer instance using the language detaults.
"""
def tokenizer_factory(nlp: "Language") -> BertTokenizer:
return BertTokenizer.from_pretrained(
nlp.vocab,
model=model
)
return tokenizer_factory
And here is some test data
{"text":"resistance to enzalutamide was reported in a patient harboring F876L mutation."}
{"text":"A retrospective study on 29 mCRPC patients progressing on abiraterone treatment reported abiraterone-resistance in 7 patients harboring AR (H874Y and T877A) mutations."}
{"text":"A retrospective study on 19 mCRPC patients progressing on enzalutamide treatment reported enzalutamide-resistance in a patient harboring AR (H874Y) mutations."}
This was the command I used to create the spacy data:
prodigy data-to-spacy \
--ner drug_rules_entities --eval-split 0.5 --config ./spacy_drug_rules.cfg \
-F ${HOME}/Dropbox/CODE/blue_heron_ai/blue_heron_ai/blue_heron_ai.py drug_rules
Where the blue_heron_ai.py file contains the BertTokenizer definition.