Following NER annotation flowchart. Questions on new model and patterns file

Hi there,

I am training new entities in the field of R&D (Health, Energy, Technology...). I am training the new entities one by one in a separate model and I am following the Annotation Flowchart: Name Entity Recognition which I have found really useful. Thanks a lot! But I still have questions because I am pretty new in Prodigy:

  1. As my new entities will not overlap with the existing ones, I am training a new model from scratch following these steps:

    nlp = spacy.blank('en')
    nlp.add_pipe(nlp.create_pipe('tagger'))
    nlp.add_pipe(nlp.create_pipe('parser'))
    nlp.begin_training()
    nlp.to_disk('blank_model')

    If I use ner.teach directly I receive the error No component 'ner' found in pipeline. Available names: ['sentencizer', 'tagger', 'parser']"

    If I use ner.batch-train and after ner.teach, I don't receive the error above but the model doesn't recognise the POS in my patterns file

    How can I create a new model from scratch for using in ner.teach recipe? Would it be better if I use en_core_web_lg instead?

  2. In my text I have short phrases that I want to capture with my entity label but I think that I may give a too vague pattern so the model will be confused. For example, my entities for energy could look like this:

And many more and many different...

So I would rather use a pattern with many many options in order to capture all these possibilities. Something that could look like this:

{'label': 'ENERGY',
  'pattern': [{'POS': 'ADV', 'OP': '?'},
   {'POS': 'NOUN', 'OP': '?'},
   {'POS': 'ADJ', 'OP': '?'},
   {'ORTH': '-', 'OP': '?'},
   {'POS': 'ADJ', 'OP': '?'},
   {'POS': 'CCONJ', 'OP': '?'},
   {'POS': 'NOUN', 'OP': '?'},
   {'POS': 'ADP', 'OP': '?'},
   {'POS': 'ADJ', 'OP': '?'},
   {'LOWER': 'bioenergy'},
   {'ORTH': '-', 'OP': '?'},
   {'POS': 'CCONJ', 'OP': '?'},
   {'POS': 'ADJ', 'OP': '?'},
   {'POS': 'NOUN', 'OP': '?'},
   {'POS': 'ADP', 'OP': '?'},
   {'POS': 'NOUN', 'OP': '?'},
   {'ORTH': '(', 'OP': '?'},
   {'POS': 'PROPN', 'OP': '?'},
   {'POS': 'NOUN', 'OP': '?'},
   {'ORTH': ')', 'OP': '?'}]},

And this for every energy term that I have on my list. I know it doesn't looks like a very clever pattern :blush: but I don't know how to capture all those possibilities... and in case I get all those short phrases labelled as ENERGY, will the model be able to learn from those annotations?

Could you please give me some advice with my questions?

And are you going to provide in the near future something similar as the ner annotation flowchart for textcat? It is very useful for beginners like me! :blush:

Thanks a lot!

Hi Maria,

Glad you're finding the flowchart helpful. I hope we can get you up and running quickly :).

When creating your model, you could get one step further by adding changing nlp.add_pipe(nlp.create_pipe('parser')) to nlp.add_pipe(nlp.create_pipe('ner')). I think that might have been what you meant to write? "parser" is the name of the syntactic dependency parser.

Adding the "ner" component would solve the immediate error you're receiving, but you'd run into a different problem. As you've mentioned, defining matcher patterns for your task will probably be quite difficult. The ner.teach recipe doesn't work that well if the problem is difficult and you're starting from a blank model. What the recipe does is try to quickly learn what you're trying to do, and then use its guesses to choose the examples for annotation. But if it starts off not knowing anything, its guesses aren't so good.

I think you should start off with the ner.manual recipe, and just spend some time annotating an unbiased sample you can use for evaluation. This will also give you a good feel for the task, so you can write more accurate rules. You probably want to make some notes as you go.

When you do want to switch over to creating your automated solution, I would first focus on a rule-based approach, probably using the syntactic dependency parse. As you've noticed, it's quite difficult to capture the phrases you're interested in as sequences of words with parts of speech. But many of the things you're interested in might be "constituents" in the parse tree: https://explosion.ai/demos/displacy?text=The%20report%20cited%20embodied%20energy%20of%20the%20foundary%20products%20as%20a%20major%20factor%20in%20temperature-specific%20performance.&model=en_core_web_sm&cpu=0&cph=0

The parse tree is pretty easy to use: https://spacy.io/usage/linguistic-features#dependency-parse . The most useful attribute for you will probably be word.subtree. In the link I sent before, energy.subtree would give you the whole phrase you want.

Hopefully what you'll be able to do is, have some process for identifying the head word of your phrase, like "energy", and then have a rule-based process that uses the dependency parse to expand out the phrase. You could eventually add some context sensitivity as well, by using a tagging approach to figure out whether some usage of "energy" is really one you're interested in.

If you first annotate your sample of texts, you'll be able to evaluate your rule-based script as you go. This should give you a much more reliable way to steadily improve performance. Eventually incorporating more task-specific machine learning might improve your accuracy, but you'll get a much better start if you can use the dependency parse at first.

I will try with the parse tree. Thanks for your reply!