Hi there,
I am training new entities in the field of R&D (Health, Energy, Technology...). I am training the new entities one by one in a separate model and I am following the Annotation Flowchart: Name Entity Recognition
which I have found really useful. Thanks a lot! But I still have questions because I am pretty new in Prodigy:
-
As my new entities will not overlap with the existing ones, I am training a
new model from scratch
following these steps:nlp = spacy.blank('en')
nlp.add_pipe(nlp.create_pipe('tagger'))
nlp.add_pipe(nlp.create_pipe('parser'))
nlp.begin_training()
nlp.to_disk('blank_model')
If I use
ner.teach
directly I receive the errorNo component 'ner' found in pipeline. Available names: ['sentencizer', 'tagger', 'parser']"
If I use
ner.batch-train
and afterner.teach
, I don't receive the error above but the model doesn't recognise thePOS
in mypatterns file
How can I create a new model from scratch for using in
ner.teach
recipe? Would it be better if I useen_core_web_lg
instead? -
In my text I have short phrases that I want to capture with my entity label but I think that I may give a too vague pattern so the model will be confused. For example, my entities for energy could look like this:
And many more and many different...
So I would rather use a pattern with many many options in order to capture all these possibilities. Something that could look like this:
{'label': 'ENERGY',
'pattern': [{'POS': 'ADV', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'ORTH': '-', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'POS': 'CCONJ', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'POS': 'ADP', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'LOWER': 'bioenergy'},
{'ORTH': '-', 'OP': '?'},
{'POS': 'CCONJ', 'OP': '?'},
{'POS': 'ADJ', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'POS': 'ADP', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'ORTH': '(', 'OP': '?'},
{'POS': 'PROPN', 'OP': '?'},
{'POS': 'NOUN', 'OP': '?'},
{'ORTH': ')', 'OP': '?'}]},
And this for every energy term that I have on my list. I know it doesn't looks like a very clever pattern but I don't know how to capture all those possibilities... and in case I get all those short phrases labelled as ENERGY, will the model be able to learn from those annotations?
Could you please give me some advice with my questions?
And are you going to provide in the near future something similar as the ner annotation flowchart
for textcat
? It is very useful for beginners like me!
Thanks a lot!