I'm trying to update an existing model with a new entity, but apparently the new dataset of annotated data I am training on includes token sequences that are exceeding the model's limit. When training, I get the following error:
Token indices sequence length is longer than the specified maximum sequence length for this model (776 > 512). Running this sequence through the model will result in indexing errors.
What would be the correct way to limit the length to 512?
This is a warning coming internally from transformers or tokenizers and you don't see actual errors because long sequences are truncated internally before they're passed to the model.