Testing ner.batch-train model:case-sensitive issue

pooja · October 18, 2019, 11:53am

Hello, I have 50k dataset for designations, converted to prodigy spans format,done batch-train with 10 iterations and saved the model to Titles_Model. Below is output.

python -m prodigy ner.batch-train titles_dataset en_core_web_sm --output Titles_Model --label TITLE --n-iter 10 --eval-split 0.2 --dropout 0.2 --unsegmented

nlp1 = spacy.load('Titles_Model')
doc = nlp1("My client is looking for Assistant General Manager in well established company"
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)

[('Assistant Genral Manager', 'TITLE')] -- correct
doc = nlp1("My client is looking for assistant General Manager in well established company")
[('Genral Manager', 'TITLE')] -- expecting assistant General Manager
doc = nlp1("looking for a Childcare Assessor SYM Tutor in Early Years in a nursery")
[('Childcare Assessor SYM Tutor in Early Years', 'TITLE')] -- correct
doc = nlp1("looking for a Childcare Assessor SYM Tutor in early years in a nursery")
[('Childcare Assessor SYM Tutor', 'TITLE')] -- expecting Childcare Assessor SYM Tutor in early years

is there anything i am missing here?. How to do case-in sensitive match?

03 1299780.668 8574 1831 9875 0 0.824
17:06:20 - MODEL: Merging entity spans of 9947 examples
17:06:20 - MODEL: Using 9947 examples (without 'ignore')
17:07:00 - MODEL: Evaluated 9947 examples
04 1301417.082 8858 1547 9881 0 0.851
17:45:38 - MODEL: Merging entity spans of 9947 examples
17:45:39 - MODEL: Using 9947 examples (without 'ignore')
17:46:19 - MODEL: Evaluated 9947 examples
05 1292085.172 8968 1437 9887 0 0.862
18:19:14 - MODEL: Merging entity spans of 9947 examples
18:19:14 - MODEL: Using 9947 examples (without 'ignore')
18:19:58 - MODEL: Evaluated 9947 examples
06 1299365.891 9034 1371 9876 0 0.868
18:48:29 - MODEL: Merging entity spans of 9947 examples
18:48:29 - MODEL: Using 9947 examples (without 'ignore')
18:49:14 - MODEL: Evaluated 9947 examples
07 1291458.556 9044 1361 9867 0 0.869
19:18:13 - MODEL: Merging entity spans of 9947 examples
19:18:20 - MODEL: Using 9947 examples (without 'ignore')
19:19:05 - MODEL: Evaluated 9947 examples
08 1289640.442 9066 1339 9866 0 0.871
19:50:08 - MODEL: Merging entity spans of 9947 examples
19:50:08 - MODEL: Using 9947 examples (without 'ignore')
19:50:52 - MODEL: Evaluated 9947 examples
09 1298395.806 9074 1331 9903 0 0.872
20:19:49 - MODEL: Merging entity spans of 9947 examples
20:19:50 - MODEL: Using 9947 examples (without 'ignore')
20:20:34 - MODEL: Evaluated 9947 examples
10 1301209.275 9069 1336 9923 0 0.872

Correct 9074
Incorrect 1331
Baseline 0.000
Accuracy 0.872

ines · October 19, 2019, 10:06am

Just to make sure I understand the question correctly: You want your model to be less case-sensitive in its predictions? What the model predicts depends on the data it was trained on – so if your data mostly contains properly capitalised examples, the trained model may struggle with all lowercase text, and vice versa. One solution for this is to use data augmentation and make sure your data contains examples of different spelling variations (capitalised, lowercase etc.).

I'd also recommend using a dedicated evaluation set instead of holding back some of the data, and using evaluation data representatitve of the texts you want to process. This gives you a more reliable way to test your model.

pooja · October 19, 2019, 8:17pm

Dear Ines Montani, thank you so much for your explanation. I didn’t think of dedicated evolution set. Will try that way.

The workflow which I mentioned seems alright for you or do I need to to gold-to-spacy and train using spacy train command to build a model?

ines · October 19, 2019, 9:09pm

The training workflow seems fine – you just want to train on better data and evaluate more consistently

pooja · October 22, 2019, 7:14am

Is there any way to specify lowercase feature while do batch-train which will work for both lowercase and uppercase ?

I can use crf model with lowercase feature with the same dataset.

ines · October 22, 2019, 10:45am

You have full control over the data you train on, so you could just generate more examples of the same annotations, but all lowercased, partially lowercased, whatever you need. The offsets won't change, so it should be easy to generate that data programmatically.

Topic		Replies	Views
accuracy not improving much with ner.batch-train usage , ner	16	934	December 20, 2019
Can't use upper-case label in patterns for ner.teach ner	17	1512	August 1, 2018
Reproducing prodigy ner.batch-train in spacy: cross-validation results and outputted model usage , ner	3	1880	October 5, 2018
different dataset for ner.batch-train usage , ner	1	422	August 28, 2019
Trouble training for Portuguese usage , ner , spacy	15	2514	December 6, 2018

Testing ner.batch-train model:case-sensitive issue

Related topics