Upper and Lower Case

I am creating some training data. In the real world sometimes the input strings are lower case, sometimes capitals and sometimes a mixture. The case doesnt matter to me, we just need the entities. Whats the best approach when training. Should I make a copy of the input strings in lower case and upper case?

Hi @alphie,

If the case is noise in your domain i.e. it does not provide a useful signal for NER (it usually does though e.g. "Apple" vs "apple") then the most efficient way would be include a normalization script in your pipeline as preprocessing before annotation/training and in production. This way you make sure that the model is trained and tested on the same formatted data.

Alternatively, if working with transformer model, you could keep annotation dataset as is and use the case insensitive transformer variant at training/production time e.g bert-base-uncased.This will lowercase all text in the tokenizer, so prediction would be the same regardless of the case.

I wouldn't duplicate the data in different cases as this would result in 3x annotation effort without real gain as you'd be creating an artificially uniform case distribution that won't match real world data.

Just for completeness: if the case is important, you should be training on realistic, mixed-case data as is.

Transformer based models, in general, deal better with variance in the case thanks to subword-tokenization and contextual embeddings.

So if the case is pure noise then add a normalization script or use case insensitive flavor of the transformer model.