Hi there,
I am training new entities in the field of R&D (Health, Energy, Technology...). I am training the new entities one by one in a separate model and I am following the Annotation Flowchart: Name Entity Recognition which I have found really useful. Thanks a lot! But I still have questions because I am pretty new in Prodigy:
-
As my new entities will not overlap with the existing ones, I am training a
new model from scratchfollowing these steps:nlp = spacy.blank('en')
nlp.add_pipe(nlp.create_pipe('tagger'))
nlp.add_pipe(nlp.create_pipe('parser'))
nlp.begin_training()
nlp.to_disk('blank_model')If I use
ner.teachdirectly I receive the errorNo component 'ner' found in pipeline. Available names: ['sentencizer', 'tagger', 'parser']"If I use
ner.batch-trainand afterner.teach, I don't receive the error above but the model doesn't recognise thePOSin mypatterns fileHow can I create a new model from scratch for using in
ner.teachrecipe? Would it be better if I useen_core_web_lginstead? -
In my text I have short phrases that I want to capture with my entity label but I think that I may give a too vague pattern so the model will be confused. For example, my entities for energy could look like this:
And many more and many different...
So I would rather use a pattern with many many options in order to capture all these possibilities. Something that could look like this:
{'label': 'ENERGY',
'pattern': [{'POS': 'ADV', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'ORTH': '-', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'POS': 'CCONJ', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'POS': 'ADP', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'LOWER': 'bioenergy'},
{'ORTH': '-', 'OP': '?'},
{'POS': 'CCONJ', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'POS': 'ADP', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'ORTH': '(', 'OP': '?'},
{'POS': 'PROPN', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'ORTH': ')', 'OP': '?'}]},
And this for every energy term that I have on my list. I know it doesn't looks like a very clever pattern
but I don't know how to capture all those possibilities... and in case I get all those short phrases labelled as ENERGY, will the model be able to learn from those annotations?
Could you please give me some advice with my questions?
And are you going to provide in the near future something similar as the ner annotation flowchart for textcat? It is very useful for beginners like me! 
Thanks a lot!


