Improving the senter's performance

Hi,

I am training some models with a UD treebank and adding to it a NER model with data I created with Prodigy. Benchmarks are good overall besides sentence boundaries recognition which is about 0.72 with the transformer model. I think this may be due to the UD treebank not having punctuation. Strangely, all punctuation marks were removed from this UD corpus.

So, I’m considering creating a supplementary corpus with sentence boundary and dependency annotations that include ‘punct’. Is this something that the prodigy/spacy team would recommend or are there other ways to improve the model regarding sentence recognition.

By the way, why is the senter disabled in most spacy models?

Thanks,

Jacobo

hi @jcbmyrstn!

Thanks for your questions!

For your questions on core spaCy (e.g., why senter is not included in most spacy models), it may be better to post on the GitHub spaCy discussion. That's where the spaCy core development team answers spaCy-specific questions. I can speculate, but the core team will give you the exact reasoning. Since the team has expanded, this forum is for Prodigy while that one is for spaCy.

They also have several related posts on improving sentence models (example 1 or example 2) that may help you.

In the meantime, have you considered using Prodigy's sentence recipes to retrain a new sentence segmentation model?

I would start with the sent.correct recipe. Use the label S to mark tokens that start a sentence. You can either start with no model (blank:en) or an existing like en_core_web_sm and start to create annotations. You can then use prodigy train and maybe after a few hundred sentences get a better model pretty quickly.

That's interesting. While not a spaCy-Prodigy solution, there are models that exist that infer punctuation like punctuator. I haven't used it but it has a cool demo to test it out.

I hope this helps and let me know if you have further questions!