ner.teach suggests spaces as entities?


I’m currently working on my thesis and quite desperate at the moment since I have another problem with prodigy:
I generated a patterns file with about 140 patterns looking e.g. like this:

Then I used the command ner.teach, and what I get are mostly are cases like in the pictures:


It looks like the whitespace characters are marked as a disease which doesn’t make any sense.
Also, it seems that ner.teach randomly marks any tokens as diseases, and quite often e.g. “ich” ( = I in English) even though I keep rejecting those! How can it be that ner.teach keeps suggesting those?

Did anyone have the same problem? Is something wrong with my patterns file? Or might it be the case that my dataset is preprocessed weirdly? (Using {“label”:“DISEASE”,“pattern”:[{“orth”:“fieber”}]} in the patterns file didn’t help either.)

I could imagine that the problem is that I’m using the German de_core_news_sm model and not an English model…

Any suggestions?
Thanks a lot in advance!

The whitespace seems to be coming from the model, not from the pattern (the score in the lower right indicates it’s using the model). I’ve run into it highlighting whitespace sometimes, but not as often as it seems to be for you. How many times have you rejected highlighted whitespace? One option would be to run ner.batch-train on what you have so far, save that model, and restart ner.teach using your newly trained model. If you’ve rejected enough white space, it should have learned to not highlight.

Good luck on your thesis!

Dear Andy,

Thank you for your answer! I created a model with ner.batch-train and restarted ner.teach. Now I got only a few highlighted whitespaces but instead punctuation marks were highlighted about half of the time!

What was problematic about creating a model with ner.batch-train is that out of my first ner.teach I only got a handful of correct entities (I did about 600 texts and of these only about 50 were correct entities!).

Through how many annotations do you run when you use ner.teach? Are 600 enough or do you need more than 1000 / 2000?

So do I get it correctly that I can basically run ner.teach, then ner.batch-train, then again ner.teach and again ner.batch-train and so on?

I ran this command after ner.batch-train with my new model “disease-model1”:
prodigy ner.teach disease_ner disease-model1 /raw_data.csv --label DISEASE --patterns disease_patterns.jsonl

If I then use ner.batch-train again, should I overwrite my “disease-model1” or generate a new model “disease-model2”?
prodigy ner.batch-train disease_ner disease-model1 --output disease-model2 --label DISEASE --eval-split 0.2 --n-iter 6 --batch-size 8

That’s right, you can keep doing the ner.batch-trainner.teach loop to keep fixing those mislabelings. (It does some updating as you annotate, but apparently not enough to stop asking about white space/periods). You’ll probably want to just overwrite each model unless you’re interested in the change in accuracy with new annotations. And make sure you use de_core_news_sm as the base model every time for ner.batch-train, not the previously saved model.

My NER work has just been on updating existing labels, so I’m not sure how many you’ll need. It also depends on the accuracy you need as well. Maybe 1000 accepts?

If most of your examples are rejects, you can try shifting the bias parameter upward so the stream favors higher probability labels. See question 73.

Finally, if you’d like the model to also remember how to annotate people, locations, etc, you’ll need to look into some strategies for overcoming the “catastrophic forgetting” problem that comes from it just seeing your disease labels for a while. See the blog post or search the support forum for some ideas. But if you just need disease labels, don’t worry about it.

Thank you so much for your detailed answer! I’ve been lost the whole day trying to figure out why prodigy wouldn’t do as I wanted it to :smiley:

I’m definitely going to try your suggestions tomorrow, especially shifting the bias parameter!

And no I don’t need to annotate those entities. However, I want to annotate drugs (medication) later on resp. I tried training a separate model on this. I thought I’d first figure out how prodigy works with one entity type before combining the two…

Happy to help! Adding a second entity should be easy. Just save the annotations in the same disease_ner db you’ve been using and one model to recognize both.

I have another question, now concerning the bias parameter:

In the prodigy-readme, all it says about the sorter is this:

from prodigy.models.ner import EntityRecognizer
from prodigy.components.sorters import prefer_uncertain

model = EntityRecognizer(nlp)
stream = prefer_uncertain(model(stream))

But what shall I enter as the stream variable? The readme says it’s “a stream of (score, example) tuples, (usually returned by a model)”. Does that mean, I shall enter my trained model here?

And concerning the general workflow with these recipes:
Do I have to save the above bit of code as a .py file?

And how do I then use it in prodigy to shift the bias parameter?

Could you possibly show me a whole example of one of your uses of prefer_uncertain and the bias parameter?

Hi! (And thanks to @andy for the great answers!)

The easiest way to get started with modifying the recipe code is to look at the built-in recipes and tweak them. This will show you how everything fits together. The source of the recipes is shipped with Prodigy, and you can find the location of your installation like this:

python -c "import prodigy; print(prodigy.__file__)"

In recipes/, you’ll then find the recipe source and examples of how the sorter is implemented.

Alternatively, here are some other ideas you could experiment with: It looks like the \n character is really mostly the problem here, and it’s something that we’ve observed before. When you run ner.teach, Prodigy will look at all possible entity analyses for the text and then suggest the ones that the model is most uncertain about. And for some reason, this seems to be the \n, at least in the beginning.

If it’s not that important for your final model to be able to deal with random newlines at runtime (for example, if you can pre-process the text before analysing it), you could just add a pre-processing step that removes newlines. Since they really throw off the model, it might be more efficient for now to just strip them out, rather than to try and teach the model about them in detail.

Additionally, you could also try and start off with the German model and a “blank” entity recognizer (instead of the pre-trained one). This especially makes sense if you’re only interested in your custom entities, and not in any of the other ones that the model predicts by default. I’m not sure if it’ll make a big difference here, but the idea is that a blank model will have no “constraints” that your new entities will have to fit to.

For example, the built-in entity types were trained on tens of thousands of examples, likely more than you will collect for your DISEASE type. The German entity recognizer also tends to struggle more with identifying entities, since it can’t rely so much on capitalisation. In English, a capitalised token is a strong indicator for an entity – in German, it could just be any regular noun. So if the pre-trained entity recognizer is already super confident that “Fieber” is a MISC or an ORG or whatever, it’ll be much more difficult to teach it a new definition.

Here’s how you can export the German model with a blank entity recognizer:

import spacy

nlp = spacy.load('de_core_news_sm')  # load base model
new_ner = nlp.create_pipe('ner')  # create blank entity recognizer
nlp.replace_pipe('ner', new_ner)  # replace old one with new component

# make sure weights of new blank component are initialized
# (this step will likely not be necessary in the future)


Prodigy can also load models from a path (just like spacy.load), so when you run ner.teach, you can now replace the model name with the path to the saved out new model:

prodigy ner.teach disease_ner /path/to/model ...

(Btw, good luck with your thesis! If I remember correctly, there were several other threads on the forum that discussed training biomedical entities with Prodigy, so maybe you’ll also find some inspiration there.)

Thanks for the hint about the recipes, I’m still trying to figure out what prodigy has to offer :wink:

That sounds like a good idea. If I did pre-process my rawdata.csv and removed the newlines, could I just use that file rawdata_no_nl.csv in the ner.teach etc.? Or will there then be a conflict in my dataset since I’ve already gone through some part of the data (which wasn’t pre-processed)?

This was a very good hint and helped a lot!!

Thanks, happy to help!

No, that should be fine. As far as Prodigy / spaCy are concerned, those will simply be additional examples. And as long as the annotations are correct and consistent, the model will still learn the right things.

If you use pre-processing, just make sure that you also run the same pre-processing function over the texts you’re processing later on, before you make predictions over them using the new model. Otherwise, you might train a super accurate and great model, but when you run it over new text, it fails miserably, because it has never seen any examples with multiple newlines during training.

I’ll keep that in mind!
Pre-processing my data, i.e. removing the newlines, did indeed help, and I don’t get them suggested as named entities anymore. However, during ner.teach my model still suggests punctuation quite frequently. Has this problem already occurred with someone else?

Additionally, my model seems to like predicting the first word in a sentence as the entity, no matter which word that is.
When I exported the model and accessed it with python/spacy, it predicted the first words of the example sentences as well!
Do I just have to annotate more examples and reject those cases more often? What else can I do about this problem?

Unfortunately, it seems that Prodigy now thinks that this pre-processed data is different data, because during ner.manual prodigy shows me those texts, which I had already annotated. --exclude my_dataset didn’t help either.
Plus it seems like prodigy just starts from the beginning of my csv file :frowning:
Do I have to start all over again?

Could you share more details about your workflow? For example, what recipes are you running and what’s your annotation strategy? And are you starting with a pre-trained or with a blank model? (Sorry, I think I’m just a little confused because I thought you were using ner.teach, but you also mentioned ner.manual?)

Yes, the --exclude logic compares the exact hashes, so a text without newlines and a text with newlines are treated as different examples. (That’s also super important, because there are a lot of scenarios where you want to explicitly annotate different variations.) But if you know how many examples you’ve already annotated in manual mode, you could just skip the first X examples, or create a new CSV with the first X lines removed?

A likely explanation for the complete randomness of suggestions is that the model hasn’t yet seen enough positive examples to make meaningful predictions. When you start annotating with a model that knows nothing about DISEASE, every token will be just as likely a DISEASE.

More importantly even, if you’re using a pre-trained model, you might see another side effect: each token can only be part of one entity, and the existing entity types were trained on thousands of examples, so the model is naturally much more confident about them. This means that the only tokens left for DISEASE are the most random ones.

Here are some ideas to try:

  • use a blank model (see code above) instead of a pre-trained one if you don’t need any of the other categories – this lets you start with a blank slate, and the suggestions aren’t influenced by the constraints of the other entities
  • if you’ve already collected data with ner.manual, run ner.batch-train to pre-train the blank German model with a few DISEASE annotations. You can then use the new pre-trained model as the base model for ner.teach. This means you get to start off with a model that already knows a tiny bit about DISEASE
  • use the pre-trained model with patterns to bootstrap DISEASE – hopefully, the predictions will be less random

First of all, sorry for the confusion I might have caused. And thank you again for your detailed answers and suggestions! Without this support forum I would have been entirely lost :smiley:

What I find difficult about prodigy is that I didn’t really know, how to get the result I’m using prodigy for, namely a model to predict disease and medication entities in user-generated content. Watching your videos and reading the documentation etc., everything seemed so easy. But trying to handle my own project, I was (and still am a bit) unsure about the workflow and which recipes I should use when. That’s why I tried ner.teach, ner.manual, ner.make-gold and ner.batch-train more or less randomly.
I know it’s difficult to generalize to all projects but I think it would have helped me a lot to have a kind of guideline regarding e.g.

  • When shall I use which recipe and how can I make the most of using them in turn?
  • How many annotations are advisable during each run of a recipe?

I did that :wink: In general, pre-processing my data definetely helped a lot to improve my model’s suggestions even though ner.teach still just seems to ignore my patterns file and suggests random words.
Anyways, what I’ve been doing (after starting from a blank model as you suggested and some 3000 annotations of ner.teach and ner.manual which didn’t seem to help my model) is, I focussed only on the medication entity and as explained in Issue 638 Annotation strategy for gold-standard data , used ner.make-gold (500 annotations) and ner.batch-train in turn.

As my model’s suggestions improved, I’m now going to really begin with the disease entity.

Andy suggested to just use the same database and overwrite my model. Will that really work? Considering the problems I had at the beginning with my medication model, I fear that the combined model is going to perform badly (I think this is what is meant by the catastrophic forgetting problem).
Would it instead make more sense to use a separate dataset to train this new entity and a disease model and in the end combine the two (in case that even works?)?