Thanks @mattdr for the details! That's helpful.
Nice performance and great work scaling up your annotations!
I can't find anything obvious still .
Could you try with your ner.correct
command to save your annotations to a different dataset than dataset_cw
?
For example:
prodigy ner.correct dataset_cw_gold models/model-last final_dataset.jsonl --label MODULE,LOGISTICS,PRODUCT,HR,POLICY
I named it "gold" because ner.correct
used to be called ner.make-gold
(see here), and is sometimes thought of as a "gold standard" annotation.
Also, can you try running print-stream
on a sample of your final_dataset.jsonl
(e.g., if you created a new file with say the first 10 records of that file)
prodigy print-stream models/model-last final_dataset_first10.jsonl
This recipe will make predictions. Just a heads up, this recipe will score all source records so that's why I recommended a smaller file. You can try on more records though .
To show it's working you can even replace models/model-last
with a pretrained ner
like en_core_web_sm
to show you want you'd expect to see.
Also -- cool trick, you can use your model to score on previously annotated data like:
prodigy print-stream models/model-last dataset:dataset_cw
or just part of the data you've (accept, reject, or ignore). For example, you can score any annotations you've ignored by running:
prodigy print-stream models/model-last dataset:dataset_cw:ignore
Alternatively, you can try your model models/model-last
in spacy-streamlit
. GitHub - explosion/spacy-streamlit: 👑 spaCy building blocks and visualizers for Streamlit apps
This gives a great interface for showing the model to users/non-data scientists but can also help make sure the model is predicting as you expect.
If you find you're having examples that show the model is predicting entities correct, but still not Prodigy, let us know.