I've been working on training new entities for food items.
My model is performing fairly well, except when the entity appears as the final token in input.
If I append any token to my input example, I get the right result.
See the same example phrase copied from ner.print-stream
output below:
Hello! Can I please place a take away order for 4 QUANTITY
special PASTA meatballs SAUCE , 5 QUANTITY marinara SAUCE gnocchi
vs
Hello! Can I please place a take away order for 4 QUANTITY
special PASTA meatballs SAUCE , 5 QUANTITY marinara SAUCE
gnocchi PASTA .
At first I thought it was a problem with training data.
I had ~900 example phrases, 500 of them generated using ner.make-gold
and the others from some additional samples ner.teach
. In the data, I had several examples with "Can I order {items}?"
Thinking maybe the model had overfit that form, I added an equal set of examples using the make-gold without the trailing ?; and still led to the same result.
The next thing I tried was pruning all the training examples that had trailing punctuation completely and retraining from scratch. Still the same result.
from prodigy.components.db import connect
DB = connect()
orig_dataset = DB.get_dataset('ner_gold')
pruned_dataset = []
for entry in orig_dataset:
if entry['text'].endswith('?') or entry['text'].endswith('.'):
continue
pruned_dataset.append(entry)
DB.add_dataset('ner_gold_no_punct', {'description': '', 'author': ''})
DB.add_examples(pruned_dataset, 'ner_gold_no_punct')
prodigy ner.batch-train ner_gold_no_punct en_core_web_lg -o models/ner_no_punct -l PASTA,SAUCE,...
is there something I'm missing? Have others encountered similar challenges?