So is there a way i can run ner.manual to manually annotate data and also use ner.teach where possible concurrently ?
Hi! Iâm not 100% sure what you mean by âconcurrentlyâ, so here are some possible use cases and solutions:
1. Running two recipes at the same time, e.g. for different annotators.
If you want to run both recipes at the same time and have different annotators work on them, thatâs easy. You can just start two separate processes. As of v1.4.0, the PRODIGY_HOST
and PRODIGY_PORT
environment variables let you overwrite the host and port to serve the web app and REST API, so you can easily set different ports on the command line:
PRODIGY_PORT=1234 prodigy ner.manual your_dataset en_core_web_sm your_data.jsonl --label PERSON
PRODIGY_PORT=5678 prodigy ner.teach other_dataset en_core_web_sm your_data.jsonl
In theory, you can also add the annotations to the same dataset. However, I wouldnât recommend that, because both recipes produce different training data and ideally, youâd want to run your experiments separately. The annotations produced by ner.manual
are great for evaluation data or to correct very specific edge cases, whereas the data your collect with ner.teach
is the best selection of examples to improve an existing model.
2. Using a hybrid of ner.manual
and ner.teach
.
If youâre looking for a recipe that uses an existing modelâs predictions and lets you correct them manually, you might want to try the ner.make-gold
recipe instead. It will stream in the texts and highlight the modelâs predictions for a given label, which you can then accept or reject, or correct accordingly:
prodigy ner.manual your_dataset en_core_web_sm your_data.jsonl --label PERSON
Note that unlike ner.teach
, ner.make-gold
wonât update a model in the loop â the reason is that with manual annotation only, youâll need a lot of examples to get meaningful results, especially if youâre training a new category. So the model in the loop wouldnât be able to learn quickly enough for the active learning to really make a difference. Which also leads to the next pointâŚ
3. Collecting ânewâ annotations with a model in the loop.
Fully manual annotation is very tedious and we actually think that itâs something you should only have to resort to for very difficult edge cases and for gold-standard evaluation data. If youâre looking to bootstrap a new entity type from scratch, Prodigy offers other, more efficient ways that help you do that â for example, supplying a list of --patterns
with explicit or abstract examples of the entities youâre looking for.
prodigy ner.teach your_dataset en_core_web_sm your_data.jsonl --patterns patterns.jsonl --label DOG
Each pattern consists of a list of token descriptions, similar to the patterns used by spaCyâs Matcher
. The patterns will be used to find initial examples in your data, until the model has seen enough to start making its own suggestions in the loop.
{"label": "DOG", "pattern": [{"lower": "poodle"}]}
{"label": "DOG", "pattern": [{"lower": "golden"}, {"lower": "retriever"}]}
Hereâs a video tutorial weâve recorded that shows the whole end-to-end workflow (see here for a TL;DR summary). We start with collecting examples of the new entity type DRUG
using word vectors, convert those examples to match patterns and then use those patterns to suggest more examples in the data during ner.teach
, until the model has learned enough so it can start making suggestions.
So my use case is similar to what you describe in method 2 but unfortunately it doesnât use Active learning. Is it possible if i create annotations from ner.manual
and ner.teach
for seperate kind of entities, for example using ner.manual for toy names and ner.teach for extracting commencement dates, and then combine those annotations to train my model on ?
Sure, that's no problem! Prodigy datasets can be exported as simple JSONL files, so you can merge them into a single dataset later on, run separate experiments training on the individual datasets or use the pre-trained model as the input model for the next training session:
prodigy ner.batch-train toys_dataset en_core_web_sm --output /output-model
prodigy ner.batch-train dates_dataset /output-model --output /new-output-model
Once you're getting more "serious" about training, you might also want to look into strategies for preventing the "catastrophic forgetting problem". You don't want your model to overfit on the new data and "forget" what it had previously learned. So one solution could be to always make sure to include examples of what the model previously "got right". Prodigy should hopefully make this a lot easier, because you can put together different datasets and run quick experiments to find out what works best on your data. (It's always difficult to give definitive advice here, because it always depends on your very specific use case and the data you're working with.)
Thanks, also, does ner.make-gold
allow using patterns too for detecting entities, for eg. instead of standard spacy entities, as, PERSON, ORG, DATE
i would want PARTY, COMMENCEMENT_DATE
etc, just like ner.teach
allow patterns ?
No, ner.make-gold
uses the modelâs predictions to suggest the entities â so its main purpose is bootstrapping gold-standard data by correcting the already existing entities, not adding new types. To train your new entity types, you might want to chain different recipes together and try a workflow like this:
- Create match patterns for your new types like
PARTY
,COMMENCEMENT_DATE
etc. - Use
ner.teach
with a model in the loop and patterns to bootstrap training examples. The active learning plus patterns can be really helpful here, because it lets you collect a larger training set more quickly. - Pre-train your model from the annotations.
- Run
ner.make-gold
with your model, see how it performs and correct its predictions. - Train again and evaluate the results (also try the
ner.train-curve
to see how your model is improving). - Identify the areas that are still problematic, and collect more specific examples.
One recipe I forgot to mention in my comment is ner.match
, which uses a patterns file to suggest entities, which you can then accept or reject. The source of the recipes is also included with Prodigy, so once youâre familiar with the built-in recipes, you could also experiment with building your own and taking inspiration from the existing recipes. For example, you could try a recipe using the ner_manual
interface thatâs populated with patterns. See this page for more details on custom recipes.
Btw, based on your examples, I assume youâre working with legal texts? If so, Iâd also recommend checking out @wpmâs posts on this forum. Heâs actually built some pretty sophisticated training pipelines with Prodigy for legal NER and has shared a lot of his findings and work in progress.
Yeah true, I am currently working with legal texts. Thanks for the advice, iâll check @wpmâs posts.