Add items to existing NER entity and update existing trained data set

Hi,

Love what you are doing with Prodigy / Spacy. I have a few questions when you have some time.

Q1: In the cookbooks, I see how you can train a new entity (which is awesome). What if I want to add more items to an existing entity. Can I use the same process and just specify the existing NER entity?

Q2: If I have a pre-existing dataset that has been trained/curated by a user and now I want to add more data for the user to label, what is the proper / suggested way to do this? Can I simply append items to the original .jsonl (if I am using a file or add rows to a DB table)?

Q3: How are folks scaling, deploying, and maintaining the prodigy/spacy models? Is it using the provided spacy microservices and docker containers (like here: https://github.com/jgontrum/spacy-api-docker)?

Thanks!

Thanks a lot!

Sure – you mean improving an existing category that the model already knows, right? In this case, you could just start the Prodigy server without patterns and correct the model's predictions of an existing category. For example:

prodigy ner.teach your_dataset en_core_web_sm your_data.jsonl --label PERSON

You can also always add manual annotations to your dataset and update the existing model with those. The ner.make-gold recipe is also useful if you want to get a feeling for the model's existing predictions and correct them by hand.

In Prodigy, a "dataset" is usually an annotated dataset. So I'll be calling the input data the "stream" to not make things confusing (also if other users come across this thread later). When you start the Prodigy server, Prodigy will load your data and request a batch of questions from it. When the user annotates the questions, they'll be saved in the database. You can tell Prodigy the dataset name when you start the server, so you can always add more annotations to the same set in the future.

Under the hood, a stream is a Python generator – so you can populate it however you like, add more examples to it later, or simply re-start the Prodigy server with a different file. All of this is up to you. In general, we always recommend loading in a lot of data at once, especially if you want to use an active learning-powered recipe to help you select the examples to annotate. You always want to have enough data to choose from – and because the stream isn't processed at once, you don't have to worry as much about memory usage and startup time.

For spaCy and spaCy models, Docker and the library you linked is definitely popular – there are also several other plugins and solutions (see here). Prodigy itself is a developer tool so it's mostly running on the developer's local machine. Right now, its main focus is to allow you to iterate on your data quickly, try out different things and run training experiments. It's a regular Python library, though, so you're pretty flexible in terms of how you want to deploy and run it.

We are currently working on an extension for Prodigy, the annotation manager, which will let you scale up annotation projects, manage multiple annotators and create larger corpora and datasets. The tool is currently under active development and we're hoping to announce the private beta for our users soon.

Thanks!