Testing ner.batch-train model:case-sensitive issue

Hello, I have 50k dataset for designations, converted to prodigy spans format,done batch-train with 10 iterations and saved the model to Titles_Model. Below is output.

python -m prodigy ner.batch-train titles_dataset en_core_web_sm --output Titles_Model --label TITLE --n-iter 10 --eval-split 0.2 --dropout 0.2 --unsegmented

nlp1 = spacy.load('Titles_Model')
doc = nlp1("My client is looking for Assistant General Manager in well established company"
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)

[('Assistant Genral Manager', 'TITLE')] -- correct
doc = nlp1("My client is looking for assistant General Manager in well established company")
[('Genral Manager', 'TITLE')] -- expecting assistant General Manager
doc = nlp1("looking for a Childcare Assessor SYM Tutor in Early Years in a nursery")
[('Childcare Assessor SYM Tutor in Early Years', 'TITLE')] -- correct
doc = nlp1("looking for a Childcare Assessor SYM Tutor in early years in a nursery")
[('Childcare Assessor SYM Tutor', 'TITLE')] -- expecting Childcare Assessor SYM Tutor in early years

is there anything i am missing here?. How to do case-in sensitive match?

03 1299780.668 8574 1831 9875 0 0.824
17:06:20 - MODEL: Merging entity spans of 9947 examples
17:06:20 - MODEL: Using 9947 examples (without 'ignore')
17:07:00 - MODEL: Evaluated 9947 examples
04 1301417.082 8858 1547 9881 0 0.851
17:45:38 - MODEL: Merging entity spans of 9947 examples
17:45:39 - MODEL: Using 9947 examples (without 'ignore')
17:46:19 - MODEL: Evaluated 9947 examples
05 1292085.172 8968 1437 9887 0 0.862
18:19:14 - MODEL: Merging entity spans of 9947 examples
18:19:14 - MODEL: Using 9947 examples (without 'ignore')
18:19:58 - MODEL: Evaluated 9947 examples
06 1299365.891 9034 1371 9876 0 0.868
18:48:29 - MODEL: Merging entity spans of 9947 examples
18:48:29 - MODEL: Using 9947 examples (without 'ignore')
18:49:14 - MODEL: Evaluated 9947 examples
07 1291458.556 9044 1361 9867 0 0.869
19:18:13 - MODEL: Merging entity spans of 9947 examples
19:18:20 - MODEL: Using 9947 examples (without 'ignore')
19:19:05 - MODEL: Evaluated 9947 examples
08 1289640.442 9066 1339 9866 0 0.871
19:50:08 - MODEL: Merging entity spans of 9947 examples
19:50:08 - MODEL: Using 9947 examples (without 'ignore')
19:50:52 - MODEL: Evaluated 9947 examples
09 1298395.806 9074 1331 9903 0 0.872
20:19:49 - MODEL: Merging entity spans of 9947 examples
20:19:50 - MODEL: Using 9947 examples (without 'ignore')
20:20:34 - MODEL: Evaluated 9947 examples
10 1301209.275 9069 1336 9923 0 0.872

Correct 9074
Incorrect 1331
Baseline 0.000
Accuracy 0.872

Just to make sure I understand the question correctly: You want your model to be less case-sensitive in its predictions? What the model predicts depends on the data it was trained on – so if your data mostly contains properly capitalised examples, the trained model may struggle with all lowercase text, and vice versa. One solution for this is to use data augmentation and make sure your data contains examples of different spelling variations (capitalised, lowercase etc.).

I'd also recommend using a dedicated evaluation set instead of holding back some of the data, and using evaluation data representatitve of the texts you want to process. This gives you a more reliable way to test your model.

Dear Ines Montani, thank you so much for your explanation. I didn’t think of dedicated evolution set. Will try that way.

The workflow which I mentioned seems alright for you or do I need to to gold-to-spacy and train using spacy train command to build a model?

The training workflow seems fine – you just want to train on better data and evaluate more consistently :slightly_smiling_face:

Is there any way to specify lowercase feature while do batch-train which will work for both lowercase and uppercase ?

I can use crf model with lowercase feature with the same dataset.

You have full control over the data you train on, so you could just generate more examples of the same annotations, but all lowercased, partially lowercased, whatever you need. The offsets won't change, so it should be easy to generate that data programmatically.