i know there are lots of variations of this question, so apologies in advance.
in my use case, i have 10 entities i would like to train, and most of them are generic, in the sense that en_core_web_lg will probably have them in the vocab. some of them are pretty domain-specific, and i was thinking that it might be better to initialize from a blank model. 10 is a lot of entities, but there doesn’t appear to be that much variance within 6 or 7 of the tags, maybe as few as 50-100. for these entities, patterns seem to capture most of the ones i care about.
what i was planning to do was:
- create a seed list, pattern list, and database for each of the 10 entities
ner.teach for each of the 10 entities. for some of them, i want to initialize from
en_core_web_lg and for the domain specific ones, i want to initialize them from a blank model. i would stop when the model has reasonable performance in
- merge the 10 databases and then do
ner.batch-train from a blank model with all 10 entities. i would repeat
ner.batch-train until i get good performance.
i did try “stacking” the annotations by starting with the entity with the least variation and adding the entity with the next least variation, etc. but i wasn’t seeing that as a suggested workflow.
Q1: is it okay to mix and match spacy models like this, as long as the databases are separated?
Q2: when does it make sense to train multiple vs individual entities?
Q3: when you do binary annotations, is the idea that you want to supply a ton of negative examples?
thank you so much! i’m really loving prodigy so far
Are you planning on training one single model from those annotations later on? If you’re not annotating with a model in the loop, it might not matter very much. But if you’re using a recipe like
ner.teach, it might.
In a recipe like
ner.teach, you’re updating the model in the loop and are collecting the best possible annotations to improve that particular model. The best possible annotations for model A might not really be the best possible annotations for model B. So if you collect accept/reject annotations on predictions from
en_core_web_lg and then update a blank model with those, the blank model might not actually learn anything meaningful from those annotations (whereas
Ideally, in your final model, you probably want to be able to recognise all entitiy types, so you’d be training on all datasets (like you’re already doing). But during the development phase, you might want to experiment with only training a subset of entity types to see how the model learns them and to debug potential problems.
For example, you might find that your final model isn’t very good, so you train from only labels A, B and C, which turns out to produce really good results. Then you add label D, and everything goes downhill. Then you train on A, B and D and it’s great. This gives you super valuable insights into what could be happening here (conflicting annotations in C and D? fuzzy boundaries between those two types? etc.).
It’s definitely common to have more negative examples than positive ones – simply because the active learning-powered recipes will ask you to annotate the most uncertain predictions among all possible analyses. There’s only one correct analysis, and many wrong analyses. That said, you do want to make sure you have at least some positive examples in there so you can perform more meaningful updates. Patterns are a good way to help with that. (Btw, for an illustration of how the updates are performed for incomplete, binary annotations, see my slides here.)
Thank you for the quick and informative response!