Your workflow sounds correct and by default, Prodigy will take care of tokenizing the text for you. The add_tokens
preprocessor will use the existing data and if no "tokens"
are present, it will try to align the existing span annotations with the tokens present in the data.
The error you’ve encountered happens if it can’t manage to do this, because none of the tokens map exactly to the character offsets defined in your data. For example, let’s say your annotations look like this:
{"text": "The order number is ID-12345", "spans": [{"start": 23, "end": 29, "label": "ID_NUMBER"}]}
Essentially, you’ve labelled the string "12345"
as an ID_NUMBER
. The problem is that when you tokenize the text with spaCy, the text isn’t actually split in a way that would make "12345"
its own token:
nlp = spacy.load('en_core_web_sm')
doc = nlp("The order number is ID-12345")
print([token.text for token in doc])
# ['The', 'order', 'number', 'is', 'ID-12345']
A case like this would then cause the “mismatched tokenization” error. The reason Prodigy lets you know about this is that a) it won’t be able to render the existing annotations in ner.manual
and b) you won’t be able to easily train a model from it out-of-the-box that performs the way you expect it to.
If you updated the default English model with the example above, it could correctly learn that a token "12345"
in that context is likely to be an ORDER_ID
. However, it might never actually come across a token like this, because the tokenizer doesn’t split the text accordingly.
That’s why Prodigy tries to let you know early on if your tokenization doesn’t match the expected output. One solution would be to update spaCy’s tokenization rules to match your expected tokenization. Tokenizer rules are serialized with the model, so you can save out the nlp
object using nlp.to_disk()
and load that modified model with Prodigy. Alternatively, if you just want to label things and don’t care about spaCy’s tokenization, you can also provide the "tokens"
property on your task that tells Prodigy how the text should be tokenized and rendered.