Trying to train module with Ner.manual after ner.batch-train results are not perfect

First, the steps you describe all look good, so you’ve been doing this part correctly :blush:

The problem is that 50 examples are nowhere near enough to train a model from scratch. Keep in mind that you’re starting off with a model that knows nothing about those categories you’re training. You also want to model to be able to learn generalised weights based on the examples you give it so it can detect other, similar entities in context as well.

You also never want your model to just memorize the training data, because this would mean that it could only detect those exact examples. So the training algorithm actively prevents the model from doing this – for example, by shuffling the examples and setting a dropout rate. This is why the model you’ve trained on 50 examples doesn’t perform well on the training data either. It’s tried to generalise based on the examples, but didn’t get enough data to do so successfully.

If you’re only labelling data with ner.manual, you need a lot of examples – ideally thousands or more.

Because this takes very long and is often inefficient, Prodigy also comes with active learning-powered workflows that make it easier to train a model with less data. Instead of labelling everything from scratch, you can work with the model. You can also use some tricks, like working with seed terms and match patterns, to give the model more examples upfront, without having to label every single example by hand. For more details on this, check out this example workflow (including the video that @honnibal already linked above).

Ideas for a solution

In your case, you could, for example, start of with a list of examples of ROLE, like “engineering manager”, “senior developer”, “CEO” etc. You can then create a patterns.jsonl file that looks like this:

{"label": "ROLE", "pattern": [{"lower": "engineering"}, {"lower": "manager"}]}
{"label": "ROLE", "pattern": [{"lower": "ceo"}]}

Each entry in "pattern" describes one token, just like in the patterns for spaCy’s Matcher. You can find more details and examples of this in the PRODIGY_README.html. Ideally, you want a lot of examples for each label, which can all live in the same patterns.jsonl file.

Next, you can use the ner.teach with the --patterns argument pointing to your patterns file. This will tell Prodigy to find matches of those terms in your data, and ask you whether they are instances of that entity type. This is especially important for ambiguous entities – for example, “bachelor” can refer to a Bachelor’s degree, but also to a person or the show “The Bachelor” :wink:

prodigy ner.teach skill_dataset en_core_web_sm /var/www/html/role_jd.jsonl --label ROLE --patterns /path/to/patterns.json

As you click accept or reject, the model in the loop will be updated, and will start learning about your new entity type. Once you’ve annotated enough examples, the model will also start suggesting entities based on what it’s learned so far. By default, the suggestions you’ll see are the ones that the model is most uncertain about – i.e. the ones with a prediction closest to 50/50. Those are also the most important ones to annotate, since they will produce the most relevant gradient for training. So don’t worry if they seem a little weird at first – this is good, because your model is still learning and by rejecting the bad suggestions, you’re able to improve it.

Because you’re only clicking accept and reject, you’ll be able to collect training data much faster. So you can repeat this process for each entity type, until you have a few hundred annotations for each type. You can then start training again and the results should be much better. You can still keep your skill_test dataset with the 50 manual annotation btw, and use it as an evaluation set. ner.manual is actually a really good recipe to create evaluation sets.

So, in summary:

  1. Create a patterns.jsonl file with examples of each entity type.
  2. Train each entity type with a model in the loop using ner.teach and your patterns, to get over the “cold start” problem.
  3. Repeat for each type until you have enough annotations.
  4. Train a model again and test it.