Misalignment for tokenization when use ner.llm.fetch and bert.ner.manual

I'm using ner.llm.fetch to generate pre-annotated NER examples, which I then refine using bert.ner.manual with --tokenizer-vocab ./cased_vocab.txt parameter. However, the annotations from ner.llm.fetch become misaligned in bert.ner.manual interface due to differences in tokenization—ner.llm.fetch uses spaCy's en_core_web_sm, while bert.ner.manual applies BERT-based tokenization. If I use ner.manual instead, alignment remains correct. Since I want to use bert.ner.manual instead of basic ner.manual(in order to embed my texts with BERT tokenization information as preparation for training my future custom BERT-based model), is there a way to change the tokenization method in ner.llm.fetch to match my BERT tokenizer in bert.ner.manual?

Thank you so much for any of your suggestions and guidance!

Welcome to the forum @Fangjian :waving_hand:

Not sure if you've seen our docs on annotation for BERT-like transformers, but spaCy v3 takes care of aligning linguistic tokenization (produced by spaCy tokenizers) to BERT tokenization before training.

So if you don't need to annotate the data that is already BERT-tokenized, you can work with spaCy default tokenizer in your Prodigy annotation workflows i.e. ner.llm.fetch and ner.manual and once you're done and ready to train a transformer pipeline you can export your data with data-to-spacy and use it for training with spaCy.
This post details each step in detail.
That is if you're planning to train a spaCy model. Let me know if that's not the case!

Thank you so much for your prompt response.

Please allow me to describe my problem more clearly. I plan to use an external transformer model, SciBERT (not the built-in spaCy model), to train my NER model. Therefore, I want to use both ner.llm.fetch and bert.ner.manual to create my training dataset.

My approach is as follows:

  1. I first use ner.llm.fetch to leverage an LLM for pre-annotation.
  2. Then, I switch to bert.ner.manual to correct the pre-annotations and ensure that my dataset is tokenized using SciBERT tokenization.

However, as I observed in the bert.ner.manual interface, the pre-annotations from ner.llm.fetch are misaligned (if I use ner.manual instead of bert.ner.manual, everything aligns in a perfect way). I believe this issue arises because the tokenization vocabulary I use in bert.ner.manual is based on SciBERT which provides scientific tokenization, whereas the ner.llm.fetch is based on spaCy default tokenization. I believe this tokenization is crucial for my future training process, as I plan to fine-tune an external SciBERT model.

I reviewed the documentation and the post you shared. My understanding is that spaCy v3 handles the alignment of linguistic tokenization, and data-to-spacy only works for spaCy’s built-in models.

I apologize for the lengthy and potentially redundant message. Could you provide guidance on resolving the tokenization misalignment issue, given that I intend to use an external model?

Hi @Fangjian ,

You're definitely right in thinking that my previous answer was valid for spaCy pipelines only! That's why at the end of it I asked if you're not planning to train a spaCy pipeline because in this case, indeed, you'll have take care of the alignment yourself.

Btw. if you're working with transformer models I do recommend checking spaCy transformers extension in case it's more convienient to stay in spaCy environment. You can train any transformer model with spaCy and there are convenient wrappers for HF transformer models - it's not limited to the built-in models.
In fact, there already exist a spaCy version of SciBert in case it's of interest - not sure it's full vocabulary, though.

To answer your question though, I think the easiest way to be able to use ner.llm.fetch output in bert.ner.manual would be to modify the spaCy nlp object in ner.llm.fetch to use the WordPiece tokenizer (or whatever tokenizer you need for your model).

In order to use a third-party tokenizer inside a spaCy pipeline, you'd first have to wrap it in a class with spaCy tokenizer API. spaCy docs have an example of such wrapping here. You should be able to adapt it to use SciBert Tokenizer directly.
Then, you'd need to modify the ner.llm.fetch recipe to so that it uses the tokenizer defined in the previous step. This could be as simple as adding the wrapper class to the recipe file and setting the tokenzer on the nlp object:

# ner.llm.fect
def llm_fetch_ner(
    config_path: Path,
    source: Union[str, Iterable[dict]],
    output: Union[str, Path],
    loader: Optional[str] = None,
    resume: bool = False,
    segment: bool = False,
    component: str = "llm",
    _extra: List[str] = [],  # cfg overrides
):
    """Get bulk zero- or few-shot suggestions with the aid of a large language model

    This recipe allows you to get LLM queries upfront, which can help if you want
    multiple annotators or reduce the waiting time between API calls.
    """
    from tokenizers import BertWordPieceTokenizer
    from spacy.tokens import Doc
    class BertTokenizer:
        def __init__(self, vocab, vocab_file, lowercase=True):
            self.vocab = vocab
            self._tokenizer = BertWordPieceTokenizer(vocab_file, lowercase=lowercase)

        def __call__(self, text):
            tokens = self._tokenizer.encode(text)
            words = []
            spaces = []
            for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)):
                words.append(text)
                if i < len(tokens.tokens) - 1:
                    # If next start != current end we assume a space in between
                    next_start, next_end = tokens.offsets[i + 1]
                    spaces.append(next_start > end)
                else:
                    spaces.append(True)
            return Doc(self.vocab, words=words, spaces=spaces)

    log("RECIPE: Starting recipe ner.llm.fetch", locals())
    config_overrides = parse_config_overrides(list(_extra)) if _extra else {}
    config_overrides[f"components.{component}.save_io"] = True
    nlp = assemble(config_path, overrides=config_overrides)
    nlp.tokenizer = BertTokenizer(nlp.vocab, "bert-base-uncased-vocab.txt") # overwrite the tokenizer
    (...)

Please note that the path to the vocabulary file ber-base-uncased-vocab.txt is hardcoded here.
Now, when you run this modified ner.llm.fetch, it should produce BERT tokens and spans which are already aligned with these tokens.

You should be able to use bert.ner.manual with the file produced by the modified ner.llm.fetch directly. Since ber.ner.manual retokenizes the text, you need to make sure the settings of the tokenizer are exactly the same as in ner.fetch.manual e.g. the lowercase setting. If you don't care about hiding the special symbols you can even directly disable the add_tokens function of the ber.ner.manual by commenting out line 94, because your input file will have the tokens and spans already in the BERT format.

Finally, to source code for the ner.llm.fetch recipe is available at your_prodigy_installation_path/recipes/llm/ner.py

You can double check your Prodigy installation path by running prodigy stats command.

Hi,

I sincerely appreciate your thorough guidance in helping me resolve the issue. I have successfully fixed the misalignment problem and reviewed the material you provided.

I have a follow-up question: You mentioned that spaCy has its own version of the SciBERT model, en_core_sci_scibert. I noticed that this model is trained on biomedical data and has a vocabulary of approximately 785k tokens, according to the official documentation. However, I know that the original SciBERT model is trained on a vocabulary of 30k scientific tokens, which indicates a significant difference.

I would like to use spaCy’s SciBERT model, but I want to better understand how it was trained before doing so. Specifically, do you know how spaCy trained the en_core_sci_scibert model? More precisely, what corpus was used to construct its vocabulary, and what techniques were employed to account for this vocabulary gap?

I have reviewed all the official documentation but could not find any details about the training process for en_core_sci_scibert. I considered posting this question on the spaCy forum, but I thought I’d ask here first.

I appreciate all the help! Thank you so much!

Hi @Fangjian,

Glad to hear you managed to solve the misalignment problem :raising_hands:
Unfortunately, I'm not familiar with the training details of en_core_sci_scibert. It looks like it's a finetuned version of SciBERT with the data coming from OntoNotes 5, MedMentions and GENIA 1.0. So the 785k tokens corpus would be the size of the fine-tuning corpus. You still benefit from the "big" SciBERT model via transfer learning as was used as the base.
I would definitely post on spaCy and/or HF fora and if not maybe even email the authors of this paper?

Thank you so much for your detailed explanation. What you said definitely makes sense and is exactly what I expected.​

If I decide to use en_core_sci_scibert to train my model along with the bert.ner.manual recipe to create my training dataset, can I still use the downloaded standard scibert-scivocab-cased vocabulary text file (which I downloaded online from HuggingFace) as an argument for --tokenizer-vocab ./bert-base-uncased-vocab.txt for the bert.ner.manual recipe? I am asking the feasibility here since I do not know if the scibert-scivocab-cased vocabulary file is compatible with en_core_sci_scibert model if I want to use them together in one stream of pipeline (scibert-scivocab-cased vocab for creating training dataset and en_core_sci_scibert as the model for training)

Hi @Fangjian,

The tokenizer you use for training should be the same that you use for annotation, otherwise you might be annotating spans that the model will never be able to learn from.
So in this case, you should be using the vocabulary of en_core_sci_scibert. (as a side note, you should make sure that case/uncased variants of the model and tokenizer match).
However, if you decide to work with the the spaCy pipeline both for the annotation and training (i.e. en_core_Sci_scibert) you don't really need to bother with token alignment any more as spaCy will take of it. In other words, if you decide to work with en_core_sci_scibert my first answer applies. You shouldn't need ner.bert.manual and you can work with ner.manual instead, using the en_core_sci_scibert as the base model for tokenization.

Also, I recommend testing your annotatation and training setup end-to-end with a small sample of the dataset to make sure everything works as expected before setting off to annotate a bigger corpus only to find out that there are some alignment issues in the data during training.

I got it! That is very clear and helpful. I would like to thank you for all of your comprehensive responses and instructions along the way. I truly appreciate it!

1 Like