So, just a quick update on a much better result, after improving the patterns file
I decided to begin with the CONTRACT_NUMBER
model from scratch, and started out querying known contract numbers in our database, then pushed token.shape_
into a set to figure out the unique shapes of a contract number we’re dealing with.
I then created the patterns file here:
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.X"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.Xdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.ddddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dddX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.dddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd.ddddXX"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXXddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddXXdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxxddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddxxdddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "dddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": ".dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": ".ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": ".dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": "ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XX"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XX"},{"SHAPE": "dddd.ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XX"},{"SHAPE": "dddd.dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.dd"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.ddd"},{"SHAPE": "d"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.X"},{"ORTH": "/"},{"SHAPE": "Xd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXddddX"},{"ORTH": "/"},{"SHAPE": "Xd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"ORTH": "-"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"ORTH": "-"},{"SHAPE": "ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd"},{"ORTH": "-"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"},{"ORTH": "-"},{"SHAPE": "dd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"},{"ORTH": "-"},{"SHAPE": "ddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXXdddd"},{"ORTH": "-"},{"SHAPE": "dddd"}]}
{"label":"CONTRACT_NUMBER","pattern":[{"SHAPE": "XXdddd.X"},{"ORTH": "/"},{"SHAPE": "Xd(XXX"},{"SHAPE": "dddd"},{"ORTH": ")"}]}
I then started annotating using:
prodigy ner.teach contract_number_ner en_core_web_lg ..\data\merged.txt --loader txt --label CONTRACT_NUMBER --patterns ContractNumbers.jsonl
After several sessions, I ended up with ~3500 annotations, most of which were rejections. I presume this is normal (majority being rejections)?
It slowly starts finding actual contract numbers, then goes bad for a while, then a batch of good ones and so on. Probably just a coincidence, and related to the annotation data being used.
Every once in a while, it revealed patterns that I didn’t cover in my initial patterns file, usually typos or OCR mistakes, with added extra space etc. I decided to add those new formats to my patterns file, and start another annotation session. The Matcher
demo page was very useful to figure out the correct shape of tokens.
I then initiated the ner.batch-train
command, and got the following output:
(morph_ner) C:\Workspace\Dev\morph_ner\contract_number>python -m prodigy ner.batch-train contract_number_ner en_core_web_lg --output models\v1 --n-iter 15 --eval-split 0.2 --dropout 0.2 --no-missing
Loaded model en_core_web_lg
Using 20% of accept/reject examples (652) for evaluation
Using 100% of remaining examples (2608) for training
Dropout: 0.2 Batch size: 16 Iterations: 15
BEFORE 0.000
Correct 0
Incorrect 897
Entities 863
Unknown 0
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 15.420 29 7 31 0 0.806
02 14.272 28 7 29 0 0.800
03 14.365 32 4 34 0 0.889
04 14.044 32 5 35 0 0.865
05 14.063 32 6 36 0 0.842
06 14.519 32 5 35 0 0.865
07 14.499 32 4 34 0 0.889
08 14.540 32 3 33 0 0.914
09 13.927 32 4 34 0 0.889
10 14.149 32 3 33 0 0.914
11 14.422 30 5 31 0 0.857
12 14.333 30 5 31 0 0.857
13 14.303 32 3 33 0 0.914
14 13.789 32 3 33 0 0.914
15 14.079 32 3 33 0 0.914
Correct 32
Incorrect 3
Baseline 0.000
Accuracy 0.914
Model: C:\Workspace\Dev\morph_ner\contract_number\models\v1
Training data: C:\Workspace\Dev\morph_ner\contract_number\models\v1\training.jsonl
Evaluation data: C:\Workspace\Dev\morph_ner\contract_number\models\v1\evaluation.jsonl
(morph_ner) C:\Workspace\Dev\morph_ner\contract_number>
I’m very happy with those stats
One thing I notice now, is the fact that all my shapes has the leading characters uppercase, XX
and not xx
- this means that when testing afterwards, it doesn’t recognize la120215.4322
whereas LA120215.4322
is recognized just fine - which makes sense, since I probably didn’t do a single annotation with lowercase la
.
I’m not sure if I should create a blank model in Space, save it and then run ner.batch-train
with that, instead of using en_core_web_lg
?
So far, very positive results! The initial learning curve for me is definitely how to get Prodigy to find the right stuff in the annotation source, so that it can provide relevant tasks.