I really love your work, the documentation and the friendly team you got. I'm very excited to dig deep into spaCy and NER in general. But before I buy prodigy and begin creating datasets and training a model i've got some questions.
Lets say I want to train a spaCy model for finding PII, but also for other entities, like "medications/drugs" and "psychological medical diagnosis". So these are very different. Looking at the "Training spaCy NER Models with Prodigy" Flowchart, I came to the conclusion I probably should:
Create a train a new model from scratch for the medications/diagnosis task
Fine-Tune a Model for the PII task (since im not adding more than 3 new entity-types)
I would actually love to have both in one single model, since they will both process the same text, but finding PII should not suffer because of other entities its trained on. You think its possible to have both in the same model? If so, should I train a model from scratch for both?
And last, should the data in the training set contain full sentences? One sentence or multiple sentences?
Thanks you very much guiding me in this regard - looking forward to your answers.
In general adding new entities to a pre-trained model tends to by tricky as you are effectively updating the existing weights based on new signal which might not be as representative as the original training data. This might have some undesirable effects, notably catastrophic forgetting (which has solutions of course but it's best avoided if possible).
In practice, adding new categories can work if the new categories are very different from the existing ones, but in most cases we would recommend training from scratch.
You could create small manually annotated dataset(s) for your desired categories (PII, medications/drugs, and psychological diagnoses), train the initial model and iterate, ideally speeding up the further annotation significantly with the model in the loop with Prodigy ner.teach or ner.correct workflows.
Starting from scratch allows you to tailor the model specifically to your domain and entity types, which can lead to better overall performance. While fine-tuning an existing model can work well for PII, starting fresh ensures that all your entity types are treated equally in the training process. Rather that for fine-tuning, you can use your existing PII model to pre-annotate your data with PII category speeding up the manual annotation of this category (see the summary below for concrete steps)!
Regarding your question about the training data:
Your training data should indeed contain full sentences or even multiple sentences. Context is crucial for NER tasks.
In summary:
To avoid some undesired effects of adding new categories to a pre-trained model, it's usually best to train from scratch.
You could still use your existing PII model to speed up the annotation of PII category. You could create one NER dataset with existing PII model in the loop and then separate NER dataset(s) for the remaining categories.
You can then pass all these separate NER datasets to Prodigy train which will take care of merging all the annotations on the same examples.
If your PII model can be implemented as a spaCy pipeline you can use directly with model-as-annotator or Prodigy active learning recipes. Please check the "I want to plug in and correct a non-spaCy model." section of the NER quickstart on how to use non-spaCy models for pre-annotation.
Finally, depending on the specifics of your categories you might consider modelling some of them with the span categorizer.