Question about inconsistent labeling between Prodigy and Jupyter notebook

Hello. I am using ner to training a citation data. One thing confused me is the reoccurring inconsistency between what I see in Prodigy and Jupyter Notebook by applying the same model (obtained through ner.correct).
For example, after i start the teach recipe with label JOURNAL

python -m prodigy ner.teach citation_2nd_teach-1_binary .\citation_2nd_correct-1_model\model-best .\citation_2nd\To-annotate\4articles.txt --label JOURNAL

The exemplar citation" Oscar Bernal et al., Assessing the Contribution of Banks, Insurance, and Other Financial Services to Systemic Risk, 47 J. Banking & Fin. 270, 271 (2014) "

It can be highlighted correctly in Jupyter Notebook, but was broken into three pages in Prodigy interface, and can not be recognized as JOURNAL. Do you have any ideas of how I can fix it and perform accept and reject in Prodigy teach? Thank you in advance!



hi @jiebei!

Great to hear from you and glad to see you're making great progress with Prodigy!

Try to add --unsegmented to your Prodigy command (i.e., python -m prodigy ner.teach ... --unsegmented and it may fix it.

When you're using "binary" recipes like ner.teach, it automatically does sentence segmentation by default. This is why it's breaking up these different parts by the sentence segmenter which is likely a generic one that isn't perfect (hence the model sometimes gets confused by periods used for different purposes). By adding --unsegmented, it'll ignore the sentence segmenter and show the entire document without sentence segmentation.

FYI - you can train a custom segmenter with Prodigy with recipes like sents.manual or sents.correct, which can be very helpful in creating a fine-tuned sentence segmenter for legal texts.

Yes, this solved the issue! Thank you, Ryan!