I am kind of new in the NLP world, and I am doing a project for the University about Transfer Learning with Named Entity Recognition. I've been doing some research about this and trying to understand how Prodigy and the spaCy
pretrain command could help me with my task.
I want to add a couple of new entities in the
es_core_news_md model (in Spanish) and I want to train it with a Data Set I've been given about Digital Marketing (about 300 .pdf files with raw text of that topic) so as to the improved model could recognize this new entities.
My idea in this moment is to do something simillar to what Matthew Honnibal did on the "TRAINING A NEW ENTITY TYPE with Prodigy" video, firstly doing a terminology list using spaCy's word vectors. So my first question is: Is the
terms.to-patterns some kind of "pretraining" of the model?
Assuming I do that and then I start teaching my new entities in context to the model by getting some predictions thanks to the patterns file, here comes my final question.
Would it be right if I say that I am doing Transfer Learning because I used the source domain,
es_core_news_md with word vectors, to create rules and then transfered that knowledge to my target model that could, hopefully, recognize some new entities about the Digital Marketing subject ?
I hope I was clear enough. Thank you!
Glad we can help you get started with NLP, and I hope your project will be successful.
While the approach you've described makes sense, I think there's also a lot of value in keeping things simple, especially when you're first starting out. I would therefore advise you to start off by annotating some data with the
ner.manual recipe. This will let you get a feel for your task, and make sure you're able to annotate it consistently. It will also help you understand how much a terminology list can help. You might find that the terms are too ambiguous for your task, or you might find that there's a small number of terms that are very useful.
If you find the annotation is quick with
ner.manual, you can also go on to train an initial model with the manual annotations. I do think creating a terminology list with
terms.to-patterns, or some other process to produce initial patterns, is likely to be useful -- but it's best to start out a bit more directly, so that you know you're making progress on your main goal.
Ok, I'll try that. Thank you very much!