Iterative meaning of data format after using bert model correct

luoshengmen98 · August 21, 2023, 4:12pm

After using part of the data for Bert.ner.manual for the first time, a preliminary bert model was formed through training. Then I used the preliminary model to carry out correct operation, but the data of this operation did not contain the unique data form of bert such as [CLS]. Then I merged the corrected data into the first annotated data for continuous iterative training. Therefore, only a small part of my existing annotated data conforms to the form of ([CLS] sentence, and the rest are ordinary data. Is there any impact on such processing？

The figure above is standard bert data

image1688×296 71.7 KB
The figure above is the extended data after continuous model iteration

ryanwesslen · August 21, 2023, 6:07pm

hi @luoshengmen98,

Thanks for your question and welcome to the Prodigy community

It sounds like you may have misaligned tokenization, i.e., inconsistent tokenization. Likely your model had a different tokenization than your annotations, which used Prodigy's bert.ner.manual.

If that's the case, you may want to add tokenization to your input (source) file using the "tokens" key, and then Prodigy will use use that tokenization.

The docs explain this and the impact of misaligned tokenization:

Pre-tokenizing the text for the manual interfaces allows more efficient annotation, because the selection can “snap” to the token boundaries and doesn’t require pixel-perfect highlighting. You can try it out in the live demo – even if you only select parts of a word, the word is still locked in as an entity. (Pro tip: For single-token entities, you can even double-click on the word!)

Surfacing the tokenization like this also lets you spot potential problems early: if your text isn’t tokenized correctly and you’re updating your model with token-based annotations, it may never actually learn anything meaningful because it’ll never actually produce tokens consistent with the annotations.

If you’re using your own model and tokenization, you can pass in data with a "tokens" property in Prodigy’s format instead of using spaCy to tokenize. Prodigy will respect those tokens and split up the text accordingly. If you do want to use spaCy to train your final model, you can modify the tokenization rules to match your annotations or set skip=True in the add_tokens preprocessor to just ignore the mismatches.

How did you do your training? Can you provide code and the setup steps?

Per the docs, this shouldn't be an issue if you train with spaCy (e.g., spacy train).

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

However, if you trained with non-spaCy, you may have misaligned tokenization, which would explain the differences you're seeing.

If you want to use spaCy for training, here's a great post:

Also - if you have future examples, can you post examples using Markdown instead of posting images? Images can't be searched/indexed, and all we'd need is one example. It was a bit hard to compare the two examples you had by image. But thanks for the details!

luoshengmen98 · August 22, 2023, 2:39am

Thank you very much for your reply! May I ask if part of the training label data is in the form of bert model, and the rest is in the form of label data without [CLS]. Is it ok to use bert model for training through the data-to-spacy process you described above?

Examples of data are as follows:
geo-terms1.jsonl (1.7 MB)

Topic		Replies	Views
data-to-spacy is not using my custom tokenizer ner , spacy	7	1090	May 15, 2023
BERT recipe when using transformer in pipeline? spacy , solved	8	1910	May 21, 2021
config.cfg for bert.ner.manual usage , ner , transformers	5	831	September 30, 2022
Misalignment for tokenization when use ner.llm.fetch and bert.ner.manual ner	8	38	March 12, 2025
Transform annotations to match tokenization required for SpanBERT/BERT spacy , transformers , spancat	19	1607	July 30, 2023

Iterative meaning of data format after using bert model correct

Related topics