Iterative meaning of data format after using bert model correct

ryanwesslen · August 21, 2023, 6:07pm

Thanks for your question and welcome to the Prodigy community

It sounds like you may have misaligned tokenization, i.e., inconsistent tokenization. Likely your model had a different tokenization than your annotations, which used Prodigy's bert.ner.manual.

If that's the case, you may want to add tokenization to your input (source) file using the "tokens" key, and then Prodigy will use use that tokenization.

The docs explain this and the impact of misaligned tokenization:

Pre-tokenizing the text for the manual interfaces allows more efficient annotation, because the selection can “snap” to the token boundaries and doesn’t require pixel-perfect highlighting. You can try it out in the live demo – even if you only select parts of a word, the word is still locked in as an entity. (Pro tip: For single-token entities, you can even double-click on the word!)

Surfacing the tokenization like this also lets you spot potential problems early: if your text isn’t tokenized correctly and you’re updating your model with token-based annotations, it may never actually learn anything meaningful because it’ll never actually produce tokens consistent with the annotations.

If you’re using your own model and tokenization, you can pass in data with a "tokens" property in Prodigy’s format instead of using spaCy to tokenize. Prodigy will respect those tokens and split up the text accordingly. If you do want to use spaCy to train your final model, you can modify the tokenization rules to match your annotations or set skip=True in the add_tokens preprocessor to just ignore the mismatches.

How did you do your training? Can you provide code and the setup steps?

Per the docs, this shouldn't be an issue if you train with spaCy (e.g., spacy train).

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

However, if you trained with non-spaCy, you may have misaligned tokenization, which would explain the differences you're seeing.

If you want to use spaCy for training, here's a great post:

Also - if you have future examples, can you post examples using Markdown instead of posting images? Images can't be searched/indexed, and all we'd need is one example. It was a bit hard to compare the two examples you had by image. But thanks for the details!

Topic		Replies	Views
data-to-spacy is not using my custom tokenizer ner , spacy	7	1090	May 15, 2023
BERT recipe when using transformer in pipeline? spacy , solved	8	1910	May 21, 2021
config.cfg for bert.ner.manual usage , ner , transformers	5	831	September 30, 2022
Misalignment for tokenization when use ner.llm.fetch and bert.ner.manual ner	8	38	March 12, 2025
Transform annotations to match tokenization required for SpanBERT/BERT spacy , transformers , spancat	19	1607	July 30, 2023

Iterative meaning of data format after using bert model correct

Related topics