Hi all,
We're having great success with Prodigy, and very much appreciating the work you've all done.
Now, we have a NER model that has been trained a bit (5k annotations) with ner.teach
, and we are out to improve it further.
Right now, what I'm working out is as such:
A spaCy pipeline, including our trained-with-Prodigy NER model, is making entity predictions over our entire dataset (50MM rows) of short paragraphs of text on our GCP instance into a BigQuery db. The pipeline has an entity normalization component that takes advantage of our existing dictionary of known entities and their variations (this is also where patterns are exported for the EntityRuler
component) to match entity variations to the canonical name (i.e. PERSON Kim K
--> PERSON Kim Kardashian
). If the entity normalization component is unable to map via our dictionary, it spits the normalized entity out as N/A
. All rows with N/A
in the normalized entity field are therefore new entities we don't have knowledge of (or a variation/misspelling of an entity we already know about but don't have in our dictionary), and so should be reviewed as either an accurate prediction ("accept" yes it's a new person), and potentially added to our dictionary.
So, next, I'm planning to query all the rows where N/A
is in the normalized entity, and then serve these text examples to annotators on a hosted Prodigy instance. The resulting annotations will then be exported to augment our existing dictionary of known entities (although, I have no idea at the moment on how to efficiently add new PERSON entities with their canonical names to our dictionary) so that when the pipeline is run, the EntityRuler
can pull in these entities as known patterns.
Then, I will batch-train
the model with these new annotations, and run the NLP pipeline again over the entire dataset (with EntityRuler
knowing about the matches from our annotators as direct matches) with the new model.
All of that said, I believe I'm not leveraging the power of Prodigy correctly in my above plan, and I'm wondering where and how in my proposal I should be using ner.make-gold
and/or ner.silver-to-gold
to achieve similar results? Given -- there are multiple labels we'd like to be predicting (e.g. with the same model, we'd like annotation tasks going for PERSON
and also for COMPANY
), and we actively use this data in our business, so it would be great to be iteratively improving on the database in a loop without all of the munging and manual steps I described above (especially since it takes a few days for the NLP pipeline to run over our complete dataset)!
Thanks for any replies, and also again for such a great NLP tool.
~TW