Using patterns and then a new model I've been training (without --base-model
specified), I've developed some gold label data for 3,000 documents and a single NER category.
The end goal is to use this model on discussions indexed from the web, first with English documents.
Should I be training this model using my gold label data and specifying one of spacy's en_core_web models as a --base-model
?
I've tried it and I think the model is improving, but I experience something strange, so wondering if maybe this is the wrong approach. The strangeness is that when I run ner.correct using the newly base-model-trained pipeline on my remaining dataset to label, it ends up warning:
The model you're using isn't setting sentence boundaries (e.g. via the parser or sentencizer). This means that incoming examples won't be split into sentences.
And when I look at the data in prodigy, sure enough it has started over at the beginning of my dataset, and everything is just the first sentence of each document.
I have to specify --unsegmented
if I want to see the full paragraphs again.
Conversely, when I run ner.correct
on a model I trained without any --base-model set, the documents are rendered completely and I don't get the warning.
And, if I should be using a --base-model, how does one decide which one to use? I see that en_core_web_trf is more accurate than en_core_web_sm, but what does that typically mean in practice? Is it just training the model that's slower and memory intensive, or will using the trained model also be slower and require more resources?