Yes, this definitely looks suspicious! Could you show some examples of the data you collected? And how many instances of the MATERIAL terms from your patterns were in your data? Did they come up a lot? 200 examples is still a pretty small set, so it's possible that you simply didn't use enough data.
One quick note on this: When you tried out the model in Prodigy, did you use ner.teach? Because what you see here can potentially be very misleading: Prodigy will get all possible analyses for the sentence and present you the examples the model is most uncertain about, i.e. the ones with predictions closest to 0.5. So those aren't necessarily the entities with the highest scores or the most "correct" ones.
If you want to see how the model performs "in real life", it'd make more sense to load it with spaCy, process a bunch of (unseen) text and look at the MATERIAL entities in doc.ents. Those are the ones that the model will actually predict.
The latest screenshot of training you’ve posted still shows a very small dataset. Is that the correct run? Because it shows only 206 training examples, and 14 examples used for evaluation. That’s probably just not enough data to train with.
Another problem is that it looks like you’re training on top of a model that already gets 13/14 correct. It’s better to start with a blank model each time you run batch train. Finally, try setting the batch size lower. When you have very little data, you usually want a low batch size.
I think the reason the number is lower is that a) you might have ignored examples and b) before training, Prodigy will combine all annotations on the same sentence into one training example. So if you're using ner.teach and you accept / reject multiple entities on the same text, all of these examples become one training example later on.
I think it might be a good idea to include a message about this in the output of the training recipes. Even just something like "Merging X examples.." or "Merged X annotations into X training examples".