Implementing ner.correct says the model you are using isn't setting sentence boundaries

tahia · July 10, 2023, 9:29am

Hi there,

I'm facing some issues in using ner.correct. My code is:

prodigy ner.correct defect_data_correct ./tmp_model3/model-best sample.jsonl --label EQUIPMENT --exclude ner_defect_labels

ner_defect_labels is my manually annotated dataset which I used to train a model and then save it to tmp_model3. The error is as follows:

Using 1 label(s): EQUIPMENT
The model you're using isn't setting sentence boundaries (e.g. via
the parser or sentencizer). This means that incoming examples won't be split
into sentences.

When I open http://localhost:8080 to correct the annotations, it says there's nothing to annotate.

Could you please help me understand what's going wrong?

Sincerely,
Tahia

koaning · July 10, 2023, 9:46am

When you look at the ner.correct recipe docs, you'll notice that there is a --unsegmented flag.

By default this recipe will split sentence on your behalf because it makes it easier to annotate examples. However, you will need to pass it a nlp model that is capable of doing that.

Just to check, how did you construct your model? The ./tmp_model3/model-best one? If it was trained using prodigy train you may want to make sure that you use en_core_web_sm as a starting point. That way, you should have all the components required to split sentences.

Alternatively, you may also:

Choose to run with the --unsegmented flag. This way there is no need for components that can split sentences.
Choose to split the data into sentences beforehand using en_core_web_sm. This can be done in a separate Python script and you can store the sentences into a new file called sentences.jsonl which you then pass to ner.correct.

tahia · July 10, 2023, 9:56am

Thank you so much for your reply! This is how I trained the model:

prodigy train ./tmp_model3 --ner ner_defect_labels --eval-split 0.3

I didn't use en_core_web_sm. Would the code look like this if I did?

prodigy train ./tmp_model3 --ner ner_defect_labels --eval-split 0.3 --base-model en_core_web_sm

Sincerely,
Tahia

koaning · July 10, 2023, 10:04am

Yep. You want to make sure this model is around beforehand though, but you should be able to simply download it via:

python -m spacy download en_core_web_sm

tahia · July 10, 2023, 1:50pm

Thank you! ner.correct works now

tahia · July 17, 2023, 11:21am

Hi @koaning ,

I implemented ner.correct and retrained my model. The accuracy improved. However, while I was using ner.correct I noticed that my base model en_core_web_sm incorrectly split the sentences. For example, a sentence that should have been a single input for annotation was split 3 times and came up in parts when I used ner.correct.

I am worried that this has incorrectly affected the model. Is there a modification I need to do to fix this?

Instead of using the base model, is it generally better to use the unsegmented tag instead?

koaning · July 17, 2023, 4:00pm

You could also try other, maybe more performant, base models.

Do you have the same issues with en_core_web_md and en_core_web_lg?

tahia · July 21, 2023, 1:03pm

Hi Koaning,

I have the same issue with en_core_web_lg. These are the commands I have implemented so far:

prodigy ner.manual ner_defect_labels en_core_web_lg sample.jsonl --label EQUIPMENT --patterns patterns.jsonl

db-out ner_defect_labels > annotations.jsonl

prodigy train ./first_train --ner ner_defect_labels --eval-split 0.3 --base-model en_core_web_lg --training.max_steps=3000 --training.optimizer.learn_rate=0.001

prodigy ner.correct first_train_correct ./first_train/model-best annotations.jsonl --label EQUIPMENT --exclude ner_defect_labels

In order to use ner.correct with en_core_web_md, do I have to repeat from ner.manual with en_core_web_md as the base model?

I'd be grateful for your help in understanding this. Please feel free to let me know if I should make a separate thread with my question.

Sincerely,
Tahia

koaning · July 24, 2023, 8:32am

That's shouldn't be needed. The model that you use while annotating in ner.manual provides the tokenisation and the sentence-splitting capabilities. The tokenisers are the same across all en_core_* models.

If you're eager to learn more, you may find this section useful from the spaCy documentation.

The only slight difference is that the sentences may be split somewhat differently. I'd be surprised if that has a significant negative impact on your final model though, mainly because you're still training on examples that you've accepted.

Does this help?

Topic		Replies	Views
Error while using ner.correct usage , ner	4	1064	January 19, 2020
Getting warning while using ner.correct usage , ner , solved	2	541	April 2, 2020
`ner.correct` doesn't show the full text usage , ner , solved	4	414	March 10, 2021
Prodigy sentence splitting during ner.correct usage , ner , spacy	3	437	February 24, 2021
Recipe ner.batch-train results in ValueError: [E030] usage , ner , spacy , solved	10	2450	June 25, 2019

Implementing ner.correct says the model you are using isn't setting sentence boundaries

Related topics