model extraction from ( prodigy command vs custom model_train code ) and usage of it.

I’m sorry if I make you guys confusing with the name of this post. I should break that down step by step

  • what I did?

–> as you see the spaCy official GitHub repository, there’s an example how to training the model based on given model (‘en’, ‘en_core_web_sm / lg’ etc). so I did modify one of the tagger model training example code for fitting the model to my data based on ‘en’ model. so it succeeds!! I could use that trained model to get a right answer for my data. but I figure our the limitation of that trained model. whenever I trying to know about the features of sentence not only pos(part-of-speech) but dep(dependency) feature by using tagger trained model. so I just suddenly wonder 'do I have to train the model again to get a dependency parser within the model?". maybe I can add the pipeline when the model is training.

  • Batch-training
    –> does this prodigy_build_in function returns a model that should have dependency parser, tagger, ner?? if so, what is the difference with train the model with example code in the GitHub spaCy repository over batch-training function?


I’m not 100% sure this is the answer to your question, but: The pos/, dep/ and ner/ model subdirectories can be combined freely. You can train a model and then copy in a previous dep/ directory, and that should work fine.

The Prodigy batch-train command does have some differences from spaCy’s training, as Prodigy supports learning from incomplete annotations. For instance, you might only know that some entity is incorrect. spaCy’s training assumes the entities are complete.

Does that help?