Hello,
So I’m trying to create an NER model to extract addresses from freeform text. I’ve created a patterns .jsonl file, which consists of ~60000 distinct patterns that an address could take. For instance, an address like “112 18th St” would be defined by:
{"label": "ADDR", "pattern": [{"shape": "ddd"}, {"shape": "ddXX"}, {"lower": "st"}]}
I then executed the following code:
prodigy ner.match addr_annotations en_core_web_lg data/source.jsonl --patterns data/address_seeds.jsonl
This mindlessly yields matches to the patterns file, but doesn’t cleverly focus on the uncertain matches. Or at least I think it doesn’t, because it seems to be asking about lots of duplicates located in similar surrounding text, even after 1500 annotations. The other shortcoming was that if a string deviates ever so slightly from a defined pattern, no novel suggestions using the model are made. Thus, I decided to give ner.teach
a try to avoid these problems, like so:
prodigy ner.teach addr_annotations en_core_web_lg data/source.jsonl --label ADDR --patterns data/address_seeds.jsonl
But to my surprise, the suggested entities seemed to entirely ignore the patterns file and served random words. In particular, even though the patterns file defined multi-token entities, every suggestion from ner.teach was just a single token. Clearly there’s nothing wrong with the patterns file itself, since the addresses were correctly suggested with ner.match
.
ner.match
output:
ner.teach
output with same model, dataset, and patterns file:
Is this the expected behaviour of ner.teach
? And if not, could anyone enlighten me as to the cause?