questions on Multi NERs Annotation & Training at Once in a Sentence

hi @ruiye!

Thanks for your message!

This is excellent. We're happy to help coach you along the way. I'll provide a lot of detailed links that can point you in the right direction. Some you may already know, but hopefully it can help other community members for context. Also, keep searching in the wealth of information in our spaCy documentation, Prodigy documentation, and in this community and spaCy GitHub community.

Can I rephrase this to align with Prodigy's terminology?

I would suggest your goal is to train one NER model with four entity types: RIGHTV, RIGHTN, ACCESSV, ACCESSN, not "4 NERs". I would recommend looking over Prodigy's glossary of terms.

ner.manual recipe produces manual annotations that highlight the full entity span, as opposed to using binary recipes that create the annotations as Yes/No. Binary recipes like ner.teach or ner.correct are usually when you already have a model that you want to improve incrementally. This is annotating with a model-in-the-loop as the model is affecting the order of annotation or at least being used to predict and the annotator corrects it.

prodigy train is simply a wrapper for spacy train. spacy train is defined by its config file. To make things easier, prodigy train will create a default config file, which is identical to spacy init config.

I see you used your own config.cfg file. This is excellent. If you're interested in how training is done, see spaCy training docs or look over the config docs.

Also, you may find the NER Prodigy docs page's section on training strategies helpful. Here's an excerpt:

Should I start with a blank model or update an existing one?

spaCy’s NER architecture was designed to support continuous updates with more examples and even adding new labels to existing trained models. Updating an existing model makes sense if you want to keep using the same label scheme and only want to fine-tune it on more data or more specific data. However, it can easily lead to inconsistent behavior if you’re adding new entity types and/or annotations that conflict with the data the model was trained on. For instance, if you suddenly want to predict all cities as CITY instead of GPE. Instead of trying to “fight” the existing weights trained on millions of words, it often makes more sense to train a new model from scratch.

Even if you’re training from scratch, you can still use a trained model to help you create training data more efficiently. Prodigy’s ner.correct will stream in the model’s predictions for the given labels and lets you manually correct the entity spans. This way, you can let a model label the entity types you want to keep, add your new types on top, and make corrections along the way. This is a very effective method for bootstrapping large gold-standard training corpora without having to do all the labelling from scratch.

Here's another discussion on the differences.

Note this was in 2019 when prodigy train was called prodigy batch-train for ner.

Also this post below details some differences between Prodigy and SpaCy. Just note that spaCy 3.0 came out since and has changed a lot. If you want a strong, reproducible project, I would encourage learning spaCy projects. You can find a great template that integrates with Prodigy as part of the spacy projects repo

Ideally, better. See this post:

See this post:

Very important: make sure to create a dedicated hold-out (evaluation) dataset early on if you're experimenting. It's easy to use the --eval-split but that will mean your evaluation dataset will change in each run. Without a very large dataset, your model's evaluation may change wildly because of different evaluation sets. This will confuse your results. If you do this, you can specify your evaluation dataset with the eval: prefix in prodigy train like:

prodigy train --ner train_data,eval:eval_data ...

Also, this post explains more:

See this post:

But you may want to check out the NER workflow (fyi we're planning to update this very soon with improved names!):

https://prodi.gy/36f76cffd9cb4ef653a21ee78659d366/prodigy_flowchart_ner.pdf

Separately isn't needed. However, here's a little background if you want to exclude some examples from annotation.

Be sure to use the --exclude argument where you can pass examples to exclude examples from a stream that you don't want to annotate. Typically, Prodigy will default in its configuration exclude by task_hash, which is automatically done. This creates a unique code (hash) that identifies every record by its input text + annotation task (e.g., ner.manual). You can change this to exclude by input_hash by changing exclude_by on your prodigy.json (config file). There are 50+ Prodigy Support issues that tack the problem of exclude. See them for examples of workflows and other questions.

Last, an incredibly powerful design philosophy of Prodigy is that: no one knows what is the right way to build your model. Instead, Prodigy is designed to allow you to rapidly experiment and iterate for your unique problem. In the Named Entity Documentation page, there's a great section on How to Choose the right recipe and workflow.