Increase the maximum length of the ner training

Hi everyone,

I'm doing a training with the command prodigy train ner and I receive this error:

ValueError: [E088] Text of length 2227606 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the 'nlp.max_length' limit. The limit is in number of characters, so you can check whether your inputs are too long by checking 'len(text)'.

My question is how can I increase this maximum length for the training? Or is it better removing the texts longer than 1000000 characters?

I'm not sure if some that isn't part of the Prodigy/Spacy-Team is supposed to answer in this forum but here it goes...

You usually don't want such long examples for training. It's my understanding that the model will only consider local context anyway so providing all that text at once does you no good.
Even if you want to have the finished model annotate longer texts, you should probably try to keep your training samples to some reasonable length, like sentences or short paragraphs.

This is also hinted at here https://prodi.gy/docs/named-entity-recognition#long-text .

In terms of how you change the nlp object for training see here https://spacy.io/usage/training#custom-code

I don't think there is an option to raise this particular limit in the config file.

3 Likes

Yes, it would be best to break your texts up into smaller units for training. For NER, we'd normally recommend paragraph-sized texts up to maybe a page or two long, like a document section. Usually context beyond the current paragraph is not useful for the NER predictions.

With smaller texts, the memory usage is a lot lower and it's easier to batch and shuffle while training, which can also improve the results.

1 Like