Token indices sequence length is too long

EM22 · February 2, 2022, 1:47pm

Hi,

I'm trying to update an existing model with a new entity, but apparently the new dataset of annotated data I am training on includes token sequences that are exceeding the model's limit. When training, I get the following error:

Token indices sequence length is longer than the specified maximum sequence length for this model (776 > 512). Running this sequence through the model will result in indexing errors.

What would be the correct way to limit the length to 512?

Thank you!

adriane · February 2, 2022, 2:02pm

This is a warning coming internally from transformers or tokenizers and you don't see actual errors because long sequences are truncated internally before they're passed to the model.

If it happens rarely, you can probably ignore it. If it's frequent, you may want to adjust the window and stride for the transformer span getter in your config. See: Receiving the warning messgae 'Token indices are too long' even after validating doc length is under max sequence length · Discussion #9277 · explosion/spaCy · GitHub

EM22 · February 3, 2022, 11:10am

Thank you for your reply and references!
I'll look into it.

Topic		Replies	Views
Token indices sequence length is longer than the specified maximum sequence length for this model ner , spacy	4	866	October 5, 2023
Hard limit on consecutive tokens in NER annotations enhancement , ner , done	3	750	June 17, 2020
Increase the maximum length of the ner training usage , ner , spacy , solved , training	2	2216	August 3, 2021
Working with longer texts usage , ner	3	683	September 10, 2020
Is there a limitation for string length for NER spacy models? usage , ner , spacy	1	1505	October 31, 2018

Token indices sequence length is too long

Related topics