I'm trying to update an existing model with a new entity, but apparently the new dataset of annotated data I am training on includes token sequences that are exceeding the model's limit. When training, I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (776 > 512). Running this sequence through the model will result in indexing errors.
What would be the correct way to limit the length to 512?
This is a warning coming internally from
tokenizers and you don't see actual errors because long sequences are truncated internally before they're passed to the model.
If it happens rarely, you can probably ignore it. If it's frequent, you may want to adjust the
stride for the
transformer span getter in your config. See: Receiving the warning messgae 'Token indices are too long' even after validating doc length is under max sequence length · Discussion #9277 · explosion/spaCy · GitHub
Thank you for your reply and references!
I'll look into it.