Upper and Lower Case

alphie · February 27, 2026, 9:03am

I am creating some training data. In the real world sometimes the input strings are lower case, sometimes capitals and sometimes a mixture. The case doesnt matter to me, we just need the entities. Whats the best approach when training. Should I make a copy of the input strings in lower case and upper case?

magdaaniol · March 2, 2026, 9:37am

Hi @alphie,

If the case is noise in your domain i.e. it does not provide a useful signal for NER (it usually does though e.g. "Apple" vs "apple") then the most efficient way would be include a normalization script in your pipeline as preprocessing before annotation/training and in production. This way you make sure that the model is trained and tested on the same formatted data.

Alternatively, if working with transformer model, you could keep annotation dataset as is and use the case insensitive transformer variant at training/production time e.g bert-base-uncased.This will lowercase all text in the tokenizer, so prediction would be the same regardless of the case.

I wouldn't duplicate the data in different cases as this would result in 3x annotation effort without real gain as you'd be creating an artificially uniform case distribution that won't match real world data.

Just for completeness: if the case is important, you should be training on realistic, mixed-case data as is.

Transformer based models, in general, deal better with variance in the case thanks to subword-tokenization and contextual embeddings.

So if the case is pure noise then add a normalization script or use case insensitive flavor of the transformer model.

Topic		Replies	Views
Testing ner.batch-train model:case-sensitive issue usage , ner	5	465	October 22, 2019
Keep case in annotation UI, but model case-insensitive usage , textcat	1	437	December 24, 2019
Can't use upper-case label in patterns for ner.teach ner	17	1660	August 1, 2018
Best Approach for My Project ner , spacy , project , best-practices	3	690	March 10, 2022
Formatted data for pre-trained spaCy models ner , spacy	1	381	January 31, 2021

Upper and Lower Case

Related topics