Difference between training pre-annotated data using SpaCy and Prodigy

I used the manual recipe and annotated data. I used the following commands:

prodigy ner.batch-train dummy_data en_core_web_sm --output /home/user --no- 

prodigy ner.teach new_dummy_data /home/user --label TESTONE,TESTTWO
  1. In the train command why do we still use the en_core_web_sm Spacy model? When we train based on our annotations, shouldn’t the parameter name be the model name that I saved on?

  2. What would be the difference between training in SpaCy and Prodigy? Will the results still be the same if we use training commands in SpaCy or using ner.batch-train actually makes a difference?

  3. After the training, in the ner.teach command, shouldn’t we pass the dataset (annotations from manual process) or the model that we trained the model on?

When you train a model, you usually need something to start with – even if it's just a blank language class like English that has the English tokenization rules, vocab etc. Sometimes you also want to start with word vectors to improve the accuracy. And you also often want to update an existing model further. So the model argument lets you pass in the base model to start with – either an existing model, or a blank model, like spacy.blank("en").to_disk("/model/path").

Under the hood, it'll always be calling nlp.update in some way, so it's doing the same thing. Prodigy's training commands are really just custom spaCy training loops. They're slightly more optimised for quick experiments and make it easy to train from incomplete and/or binary annotations (e.g. the ones you collect in ner.teach). If you train with spacy train on the other hand, that's more optimised to train from large, gold-standard corpora.

Yes, exactly – assuming you want to improve the model you just trained in the loop. That's the second argument on the command-line, in your example: /home/user (in a real-life scenario, you probably want to be choosing a better subdirectory here).