ner_defect_labels is my manually annotated dataset which I used to train a model and then save it to tmp_model3. The error is as follows:
Using 1 label(s): EQUIPMENT
The model you're using isn't setting sentence boundaries (e.g. via
the parser or sentencizer). This means that incoming examples won't be split
into sentences.
When I open http://localhost:8080 to correct the annotations, it says there's nothing to annotate.
Could you please help me understand what's going wrong?
When you look at the ner.correct recipe docs, you'll notice that there is a --unsegmented flag.
By default this recipe will split sentence on your behalf because it makes it easier to annotate examples. However, you will need to pass it a nlp model that is capable of doing that.
Just to check, how did you construct your model? The ./tmp_model3/model-best one? If it was trained using prodigy train you may want to make sure that you use en_core_web_sm as a starting point. That way, you should have all the components required to split sentences.
Alternatively, you may also:
Choose to run with the --unsegmented flag. This way there is no need for components that can split sentences.
Choose to split the data into sentences beforehand using en_core_web_sm. This can be done in a separate Python script and you can store the sentences into a new file called sentences.jsonl which you then pass to ner.correct.
I implemented ner.correct and retrained my model. The accuracy improved. However, while I was using ner.correct I noticed that my base model en_core_web_sm incorrectly split the sentences. For example, a sentence that should have been a single input for annotation was split 3 times and came up in parts when I used ner.correct.
I am worried that this has incorrectly affected the model. Is there a modification I need to do to fix this?
Instead of using the base model, is it generally better to use the unsegmented tag instead?
That's shouldn't be needed. The model that you use while annotating in ner.manual provides the tokenisation and the sentence-splitting capabilities. The tokenisers are the same across all en_core_* models.
The only slight difference is that the sentences may be split somewhat differently. I'd be surprised if that has a significant negative impact on your final model though, mainly because you're still training on examples that you've accepted.