BERT recipe when using transformer in pipeline?

Hi everyone,

My goal is to train a model for NER and I have question regarding tokenization.
In my pipeline I want to use BERT but I'm not sure if that means that I have to use BERT's tokenizer during annotation.
Is the bert.ner.manual recipe from the docs only supposed to be used if I'm wanting to feed the data to BERT directly, or also if I'm using BERT as part of a spacy NER model?

There is this image int he spacy docs

This makes it seem like there is a separate tokenizer used, whether I'm using a transformer or not. So I'm not sure I should be annotating data as if they were directly fed into BERT.

I'm not sure I'm making myself clear, but I hope someone can help me out.
Any hints would be much appreciated.

1 Like

No, you don't have to use the BERT tokenizer for NER annotation. In the pipeline diagram above, the transformer component handles the alignment between spacy tokens and BERT tokens underneath, so you can work with only spacy tokens if you'd like. The tokenizer in that diagram is the spacy tokenizer, which is configurable and could theoretically be a wordpiece tokenizer, but typically it's the default rule-based tokenizer in spacy (currently spacy.Tokenizer.v1).

Just on the annotation side of things, most spacy token boundaries correspond to wordpiece token boundaries, so the difference between annotating with BERT wordpiece tokens and spacy tokens is very minor.

If your goal is to train a spacy NER component, then it makes sense to annotate using the spacy tokenization because that corresponds best to how the model will be trained and evaluated. We train all the provided trf pipelines from data aligned to word-level tokens and it's fine. There are a small number of misalignments between the training data and the spacy tokens and also a small number of misalignments between the spacy tokenizer and the wordpiece tokens, but nearly all entity spans align without issues and the NER component is designed to ignore the few cases that are misaligned.

1 Like

Thank you, that clears things up nicely :slight_smile:

Hi Patty,

I am looking to use BERT as a model in a loop for ner task but am unable to figure out how to do so. Can you please help me out in providing some guidance there. I have read the documentation given on usage of custom models but still the process is unclear to me. Thanks in advance for your time.

It really depends on what you're trying to do – for example, do you just want to align your annotations to the word piece tokenization? Do you want to initialise a model with BERT embeddings to improve accuracy? Or do you already have a trained model that you want to use to make suggestions?

1 Like

Hi Ines, Thanks for your time!

What I am trying to do is that I have a trained BERT model for NER which I want to use as a model in a loop (instead of spacy models) for manual labelling as well as correcting model's prediction and also letting model correct its prediction in loop like we can use spacy model with ner.manual, ner.teach, ner.correct, and ner.train commands. I wrote the script for custom model with all the function mentioned here https://prodi.gy/docs/named-entity-recognition#custom-model with predict and update functions w.r.t. bert and packed them in custom_ner_recipe function. But when I try to run this recipe, it takes around around 45 minutes before it gives "Error while validating stream: no first example This likely means that your stream is empty." error. When I turn validate: false in config file, although the command gets executed and prodigy gets hosted in a localhost, but the UI only displays Loading.

Also, any timeline on when we can use prodigy with spacy v3? As this will make my work far too easy by letting me use spacy-transformer library and load a bert transformer as a spacy model.

FYI, jsonl file which I am passing as an input file has following structure:

{"text": "[CLS] Some text goes here[SEP]", "spans": [{"start": 6, "end": 17, "label": "LABEL1"}, {"start": 33, "end": 55, "label": "LABEL1"}, {"start": 59, "end": 119, "label": "LABEL1"}, {"start": 122, "end": 194, "label": "LABEL1"}, {"start": 207, "end": 219, "label": "LABEL1"}, {"start": 248, "end": 253, "label": "LABEL1"}, {"start": 256, "end": 257, "label": "LABEL1"}, {"start": 1227, "end": 1257, "label": "LABEL1"}, {"start": 1260, "end": 1265, "label": "LABEL1"}]}
{"text": "[CLS] Some text goes here[SEP]", "spans": [{"start": 11, "end": 37, "label": "LABEL1"}, {"start": 42, "end": 64, "label": "LABEL1"}, {"start": 97, "end": 192, "label": "LABEL1"}, {"start": 231, "end": 239, "label": "LABEL1"}, {"start": 290, "end": 326, "label": "LABEL1"}, {"start": 339, "end": 349, "label": "LABEL1"}, {"start": 350, "end": 406, "label": "LABEL1"}, {"start": 411, "end": 426, "label": "LABEL1"}, {"start": 443, "end": 456, "label": "LABEL1"}, {"start": 493, "end": 501, "label": "LABEL1"}, {"start": 505, "end": 508, "label": "LABEL1"}, {"start": 517, "end": 524, "label": "LABEL1"}, {"start": 529, "end": 532, "label": "LABEL1"}, {"start": 581, "end": 601, "label": "LABEL1"}, {"start": 604, "end": 609, "label": "LABEL1"}]}

When you set PRODIGY_LOGGING=basic, is there anything in the logs that looks relevant? If you end up with no examples in the stream, this typically means that all examples were skipped, either because they're already annotated in the dataset, or because they're invalid for some othe reason (invalid JSON, no "text").

Also double-check that your stream generator doesn't get stuck in an infinite loop or similar by accident (bugs here can sometimes be pretty subtle), and if you're using PyTorch, check that PyTorch doesn't spawn multiple threads under the hood. (If it does, try moving the stream processing logic into a separate Python script and pip the JSON output forward, so you can ensure it runs in the main thread.)

Just a quick note, it's possible that the update callback will end up tricky to implement with the large transformer models: the updating itself can be a bit slow (especially on CPU), and the models usually expect larger batch sizes and don't always respond well to small individual batch updates.

So it might turn out that a better workflow for transformers in the loop is to annotate ~100 examples, train, load the new model in, annotate another ~100 examples, and so on.

We already have a nightly pre-release out that you can try: ✨ Prodigy nightly: spaCy v3 support, UI for overlapping spans & more We're hoping to have the stable release ready within the next few weeks – the main feature holding it up was improved support in spaCy v3 for binary annotations and learning from "negative examples (see this PR).

Thanks Ines for the detailed answer.

Yes, all my examples were already annotated, actually these were the predictions made by bert, and I wanted to use this custom recipe like ner.correct. Let me try with un-annotated data and see is this entire process works. Thanks

Ah, sorry if my post was a bit unclear – what I meant there was, annotated examples = examples that were already saved to a Prodigy dataset. If your dataset already contains an example with the same text, Prodigy will skip it if the text comes in again, so you're only asked the same question once.