ner.teach to silver to gold -- how to best leverage Prodigy's recipes

Hi all,

We're having great success with Prodigy, and very much appreciating the work you've all done.

Now, we have a NER model that has been trained a bit (5k annotations) with ner.teach, and we are out to improve it further.

Right now, what I'm working out is as such:

A spaCy pipeline, including our trained-with-Prodigy NER model, is making entity predictions over our entire dataset (50MM rows) of short paragraphs of text on our GCP instance into a BigQuery db. The pipeline has an entity normalization component that takes advantage of our existing dictionary of known entities and their variations (this is also where patterns are exported for the EntityRuler component) to match entity variations to the canonical name (i.e. PERSON Kim K --> PERSON Kim Kardashian). If the entity normalization component is unable to map via our dictionary, it spits the normalized entity out as N/A. All rows with N/A in the normalized entity field are therefore new entities we don't have knowledge of (or a variation/misspelling of an entity we already know about but don't have in our dictionary), and so should be reviewed as either an accurate prediction ("accept" yes it's a new person), and potentially added to our dictionary.

So, next, I'm planning to query all the rows where N/A is in the normalized entity, and then serve these text examples to annotators on a hosted Prodigy instance. The resulting annotations will then be exported to augment our existing dictionary of known entities (although, I have no idea at the moment on how to efficiently add new PERSON entities with their canonical names to our dictionary) so that when the pipeline is run, the EntityRuler can pull in these entities as known patterns.

Then, I will batch-train the model with these new annotations, and run the NLP pipeline again over the entire dataset (with EntityRuler knowing about the matches from our annotators as direct matches) with the new model.

All of that said, I believe I'm not leveraging the power of Prodigy correctly in my above plan, and I'm wondering where and how in my proposal I should be using ner.make-gold and/or ner.silver-to-gold to achieve similar results? Given -- there are multiple labels we'd like to be predicting (e.g. with the same model, we'd like annotation tasks going for PERSON and also for COMPANY), and we actively use this data in our business, so it would be great to be iteratively improving on the database in a loop without all of the munging and manual steps I described above (especially since it takes a few days for the NLP pipeline to run over our complete dataset)!

Thanks for any replies, and also again for such a great NLP tool.

~TW

I think the workflow you've set up honestly doesn't sound bad, so I'd be reluctant to give you (inherently speculative) advice that would ask you to do a bunch of work to replace something that's already giving you results.

It's true that there can be advantages to having the representation improve "in the loop". However, this helps the most on smaller tasks where starting and stopping the server all the time would get in the way of your workflow. If you're working live on a small problem, it matters a lot that the state changes over a matter of minutes. That small-scale state change really needs to happen within the process. But if your state is changing over the course of hours or days, having a background process compute the state change is very helpful. It's easier to reason about and debug, and the state isn't tied into the running process --- so you can scale the Prodigy service, use pre-emptible instances, etc.

The part that I would work on is this:

(especially since it takes a few days for the NLP pipeline to run over our complete dataset)!

There's no good reason why that should be the case if you're using GCP. Parsing text is an emabarrassingly parallel workload, so if you need to do 60 hours of parsing, you should be launching 60 VMs to complete it in one hour.

There's a lot of bad advice about scaling data processing workloads, that massively over-complicates the problem. The simple answer is: you probably don't want to use something like Spark, and you probably don't need to use something like Kubernetes either (although it might be easier if you already know it). Just cut up your data into chunks, and have a script that can read in a chunk, process it, and write out your results. Make the number of chunks the number of processes you intend to launch, and let it rip.

If you want to avoid investment in the devops side of this, the easy way is to just use one machine with lots of CPUs. I would advise setting up a VM image using Packer, so that you can launch an instance group of the workers. Another trick I use is to mount the storage buckets using GCSFuse. This way you can use storage buckets as simple and cheap shared file-system, to read and write the results.

2 Likes

Thanks for your replies above, and especially for reminding me that I could simply create a giant machine and spin up multiple instances of spaCy on it for the long-running text processing task! I got into the weeds over the weekend trying to programmatically launch VMs using startup scripts via gcloud commands and it ultimately proved fruitless.

I wouldn't say I've done a bunch of work yet (in so far as I haven't coded the entire pipeline yet), just thinking through the problems at this point!

After reading your replies, and my lengthy original question, I've realized that my specific questions are:

  1. What is the best way to use Prodigy to create new entity normalization mappings? E.g., when the EntityRuler has predicted a given entity is a COMPANY but we don't have a norm mapping for it, is there an intended workflow for annotating that (A) -- yes, this is in fact a COMPANY and if yes then (B) -- the normalized entity is X ?

  2. Am I ignorantly skipping over some of the power of ner.make-gold and ner.silver-to-gold in my system as proposed above?