hi @luoshengmen98,
Thanks for your question and welcome to the Prodigy community
It sounds like you may have misaligned tokenization, i.e., inconsistent tokenization. Likely your model had a different tokenization than your annotations, which used Prodigy's bert.ner.manual
.
If that's the case, you may want to add tokenization to your input (source) file using the "tokens"
key, and then Prodigy will use use that tokenization.
The docs explain this and the impact of misaligned tokenization:
Pre-tokenizing the text for the manual interfaces allows more efficient annotation, because the selection can “snap” to the token boundaries and doesn’t require pixel-perfect highlighting. You can try it out in the live demo – even if you only select parts of a word, the word is still locked in as an entity. (Pro tip: For single-token entities, you can even double-click on the word!)
Surfacing the tokenization like this also lets you spot potential problems early: if your text isn’t tokenized correctly and you’re updating your model with token-based annotations, it may never actually learn anything meaningful because it’ll never actually produce tokens consistent with the annotations.
If you’re using your own model and tokenization, you can pass in data with a
"tokens"
property in Prodigy’s format instead of using spaCy to tokenize. Prodigy will respect those tokens and split up the text accordingly. If you do want to use spaCy to train your final model, you can modify the tokenization rules to match your annotations or setskip=True
in theadd_tokens
preprocessor to just ignore the mismatches.
How did you do your training? Can you provide code and the setup steps?
Per the docs, this shouldn't be an issue if you train with spaCy (e.g., spacy train
).
spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use
data-to-spacy
to export your annotations and train with spaCy v3 and a transformer-based config directly, or runtrain
and provide the config via the--config
argument.
However, if you trained with non-spaCy, you may have misaligned tokenization, which would explain the differences you're seeing.
If you want to use spaCy for training, here's a great post:
Also - if you have future examples, can you post examples using Markdown instead of posting images? Images can't be searched/indexed, and all we'd need is one example. It was a bit hard to compare the two examples you had by image. But thanks for the details!