Iterative meaning of data format after using bert model correct

After using part of the data for Bert.ner.manual for the first time, a preliminary bert model was formed through training. Then I used the preliminary model to carry out correct operation, but the data of this operation did not contain the unique data form of bert such as [CLS]. Then I merged the corrected data into the first annotated data for continuous iterative training. Therefore, only a small part of my existing annotated data conforms to the form of ([CLS] sentence, and the rest are ordinary data. Is there any impact on such processing?

  • The figure above is standard bert data
  • The figure above is the extended data after continuous model iteration

hi @luoshengmen98,

Thanks for your question and welcome to the Prodigy community :wave:

It sounds like you may have misaligned tokenization, i.e., inconsistent tokenization. Likely your model had a different tokenization than your annotations, which used Prodigy's bert.ner.manual.

If that's the case, you may want to add tokenization to your input (source) file using the "tokens" key, and then Prodigy will use use that tokenization.

The docs explain this and the impact of misaligned tokenization:

Pre-tokenizing the text for the manual interfaces allows more efficient annotation, because the selection can “snap” to the token boundaries and doesn’t require pixel-perfect highlighting. You can try it out in the live demo – even if you only select parts of a word, the word is still locked in as an entity. (Pro tip: For single-token entities, you can even double-click on the word!)

Surfacing the tokenization like this also lets you spot potential problems early: if your text isn’t tokenized correctly and you’re updating your model with token-based annotations, it may never actually learn anything meaningful because it’ll never actually produce tokens consistent with the annotations.

If you’re using your own model and tokenization, you can pass in data with a "tokens" property in Prodigy’s format instead of using spaCy to tokenize. Prodigy will respect those tokens and split up the text accordingly. If you do want to use spaCy to train your final model, you can modify the tokenization rules to match your annotations or set skip=True in the add_tokens preprocessor to just ignore the mismatches.

How did you do your training? Can you provide code and the setup steps?

Per the docs, this shouldn't be an issue if you train with spaCy (e.g., spacy train).

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

However, if you trained with non-spaCy, you may have misaligned tokenization, which would explain the differences you're seeing.

If you want to use spaCy for training, here's a great post:

Also - if you have future examples, can you post examples using Markdown instead of posting images? Images can't be searched/indexed, and all we'd need is one example. It was a bit hard to compare the two examples you had by image. But thanks for the details!

Thank you very much for your reply! May I ask if part of the training label data is in the form of bert model, and the rest is in the form of label data without [CLS]. Is it ok to use bert model for training through the data-to-spacy process you described above?