Improving the senter's performance

jcbmyrstn · September 13, 2022, 7:05pm

Hi,

I am training some models with a UD treebank and adding to it a NER model with data I created with Prodigy. Benchmarks are good overall besides sentence boundaries recognition which is about 0.72 with the transformer model. I think this may be due to the UD treebank not having punctuation. Strangely, all punctuation marks were removed from this UD corpus.

So, I’m considering creating a supplementary corpus with sentence boundary and dependency annotations that include ‘punct’. Is this something that the prodigy/spacy team would recommend or are there other ways to improve the model regarding sentence recognition.

By the way, why is the senter disabled in most spacy models?

Thanks,

Jacobo

ryanwesslen · September 13, 2022, 9:53pm

hi @jcbmyrstn!

Thanks for your questions!

For your questions on core spaCy (e.g., why senter is not included in most spacy models), it may be better to post on the GitHub spaCy discussion. That's where the spaCy core development team answers spaCy-specific questions. I can speculate, but the core team will give you the exact reasoning. Since the team has expanded, this forum is for Prodigy while that one is for spaCy.

They also have several related posts on improving sentence models (example 1 or example 2) that may help you.

In the meantime, have you considered using Prodigy's sentence recipes to retrain a new sentence segmentation model?

I would start with the sent.correct recipe. Use the label S to mark tokens that start a sentence. You can either start with no model (blank:en) or an existing like en_core_web_sm and start to create annotations. You can then use prodigy train and maybe after a few hundred sentences get a better model pretty quickly.

That's interesting. While not a spaCy-Prodigy solution, there are models that exist that infer punctuation like punctuator. I haven't used it but it has a cool demo to test it out.

I hope this helps and let me know if you have further questions!

Topic		Replies	Views
Train dependency parser to detect sentences boundaries usage , spacy , solved , dep	4	818	May 3, 2019
Custom sentence segmenter with Prodigy 1.11/Spacy 3.x usage , spacy , solved , senter	3	499	August 17, 2021
Sentencizer configuration questions spacy	2	842	February 19, 2023
senter shows white space as Sentence Starting usage , senter	3	541	December 21, 2021
Custom sentence boundaries detection usage , spacy	10	1672	June 27, 2019

Improving the senter's performance

Related topics