I am working to build an NER engine and want to use my own data to train en_core_web_sm. I have several fairly large data files that I use for training, which is where prodigy’s nice annotation functionality comes in. However, I’m a bit unsure what the best approach is, which probably stems from a limited understanding of how the annotations function works.
My approach so far has been to create a single dataset, do a successive number of ner.teach runs, one for each of my training data files and each entity type that I’m interested in, and in the very end, when all the annotations have been completed, I use ner.batch-train on my single dataset (containing the annotations from all the separate runs of ner.teach), create-meta and all that to create a loadable spaCy module. However, when doing the annotations, I can see that prodigy often confronts me with examples that I have already dealt with many times before when annotating previous files, even though the annotations are being saved to the same dataset. So, say I’m running ner.teach on my first training data file and I annotate that ‘Disneyland’ is not a person 10 times, then prodigy will still confront me with examples asking me if ‘Disneyland’ is a person when I use ner.teach on the next training data file. It’s a bit unclear to me how the annotation process works, but it’s my impression that the annotations are entered into the dataset independently of each other, i.e. there is no active learning element in the annotation / ner.teach process itself, and it’s not until one calls ner.batch-train that the model is actually updated with the annotations. Is this right so far?
Assuming that it is, then I would suspect that it would probably be more useful to run ner.batch-train and save a model to a loadable spaCy module each time I have gone through a training data file, and then use this loadable module as the basis for the next run of ner.batch-train, instead of re-using en_core_web_sm. My presumption here is that my annotations from the first training data file will have been learned by the loadable module I’ve just created, and hence when using that module as my basis it will not ask me again if ‘Disneyland’ is a person, but instead focus on asking more interesting questions that build on that knowledge, so to speak. That would obviously lead to a fair bit of bookkeeping and take more time (since ner.batch-train takes a fair bit of time to run), in that I would then have to save each new ner.teach run in a separate dataset, for example (right?). But if the results warrant it, then that would be a minor issue, obviously. Have I understood this correctly? I’m a bit worried that I’ve missed some fundamental element of how prodigy works, so I would like to get clarification on this before I do a lot of potentially superfluous work.