Implementing ner.correct says the model you are using isn't setting sentence boundaries

Hi there,

I'm facing some issues in using ner.correct. My code is:

prodigy ner.correct defect_data_correct ./tmp_model3/model-best sample.jsonl --label EQUIPMENT --exclude ner_defect_labels

ner_defect_labels is my manually annotated dataset which I used to train a model and then save it to tmp_model3. The error is as follows:

Using 1 label(s): EQUIPMENT
:warning: The model you're using isn't setting sentence boundaries (e.g. via
the parser or sentencizer). This means that incoming examples won't be split
into sentences.

When I open http://localhost:8080 to correct the annotations, it says there's nothing to annotate.

Could you please help me understand what's going wrong?

Sincerely,
Tahia

When you look at the ner.correct recipe docs, you'll notice that there is a --unsegmented flag.

By default this recipe will split sentence on your behalf because it makes it easier to annotate examples. However, you will need to pass it a nlp model that is capable of doing that.

Just to check, how did you construct your model? The ./tmp_model3/model-best one? If it was trained using prodigy train you may want to make sure that you use en_core_web_sm as a starting point. That way, you should have all the components required to split sentences.

Alternatively, you may also:

  • Choose to run with the --unsegmented flag. This way there is no need for components that can split sentences.
  • Choose to split the data into sentences beforehand using en_core_web_sm. This can be done in a separate Python script and you can store the sentences into a new file called sentences.jsonl which you then pass to ner.correct.
1 Like

Thank you so much for your reply! This is how I trained the model:

prodigy train ./tmp_model3 --ner ner_defect_labels --eval-split 0.3

I didn't use en_core_web_sm. Would the code look like this if I did?

prodigy train ./tmp_model3 --ner ner_defect_labels --eval-split 0.3 --base-model en_core_web_sm

Sincerely,
Tahia

Yep. You want to make sure this model is around beforehand though, but you should be able to simply download it via:

python -m spacy download en_core_web_sm
1 Like

Thank you! ner.correct works now

1 Like

Hi @koaning ,

I implemented ner.correct and retrained my model. The accuracy improved. However, while I was using ner.correct I noticed that my base model en_core_web_sm incorrectly split the sentences. For example, a sentence that should have been a single input for annotation was split 3 times and came up in parts when I used ner.correct.

I am worried that this has incorrectly affected the model. Is there a modification I need to do to fix this?

Instead of using the base model, is it generally better to use the unsegmented tag instead?

You could also try other, maybe more performant, base models.

Do you have the same issues with en_core_web_md and en_core_web_lg?

1 Like

Hi Koaning,

I have the same issue with en_core_web_lg. These are the commands I have implemented so far:

prodigy ner.manual ner_defect_labels en_core_web_lg sample.jsonl --label EQUIPMENT --patterns patterns.jsonl
db-out ner_defect_labels > annotations.jsonl
prodigy train ./first_train --ner ner_defect_labels --eval-split 0.3 --base-model en_core_web_lg --training.max_steps=3000 --training.optimizer.learn_rate=0.001
prodigy ner.correct first_train_correct ./first_train/model-best annotations.jsonl --label EQUIPMENT --exclude ner_defect_labels

In order to use ner.correct with en_core_web_md, do I have to repeat from ner.manual with en_core_web_md as the base model?

I'd be grateful for your help in understanding this. Please feel free to let me know if I should make a separate thread with my question.

Sincerely,
Tahia

That's shouldn't be needed. The model that you use while annotating in ner.manual provides the tokenisation and the sentence-splitting capabilities. The tokenisers are the same across all en_core_* models.

If you're eager to learn more, you may find this section useful from the spaCy documentation.

The only slight difference is that the sentences may be split somewhat differently. I'd be surprised if that has a significant negative impact on your final model though, mainly because you're still training on examples that you've accepted.

Does this help?