ner.correct text split across multiple screens in Prodigy GUI

stefan.bartell · January 19, 2023, 7:45pm

I trained a model, and I'm using ner.correct to correct its behavior tagging dosages and non-dosages (expressions that look like dosages) in comments on new text.

The new text is a CSV file with one comment per line, and it looks like this:

text
I'm down six pounds since starting and pretty much eating whatever I want. ...
Originally Posted by Scarzmeanwhile he can help out in the breast feeding department...

Here is the command I'm using:

python3 -m prodigy ner.correct identify_dosages_non_dosages_validate_data_correct_SB ./identify_dosages_non_dosages_validate_data_model_SB/model-best ./../data/bb.corpus.deduplicated.cleaned.after_ner_train_merged_subtracted_dev.csv --label dosage,non_dosage

The problem is that when I enter the Prodigy GUI to correct the model's annotations, lines of text from the CSV file are being split across multiple screens (which I accept separately) as if they're multiple comments, when in fact they are only one comment. This may be a problem when I'm trying to analyze the annotations. Does anyone know how to stop the GUI from splitting comments across multiple screens?

ryanwesslen · January 19, 2023, 8:39pm

hi @stefan.bartell!

I suspect it is ner.correct's default sentence segmentation.

You can turn this off by adding --unsegmented:

python -m prodigy ner.correct gold_ner en_core_web_sm ./news_headlines.jsonl --label PERSON --unsegmented

It's important to know why Prodigy was designed this way for model-in-the-loop recipes (e.g., ner.correct and ner.teach). The ner docs explain:

You probably noticed that most of the examples on this page show short texts like sentences or paragraphs. For NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. Annotating with a model in the loop is also much faster if the texts aren’t too long, which is why recipes like ner.teach and ner.correct split sentences by default. NER model implementations also typically use a narrow contextual window of a few tokens on either side. If a human annotator can’t make a decision based on the local context, the model will struggle to learn from the data.

It would make sense to turn this off if your data isn't written in complete sentences. The sentence segmenter won't work if the input text isn't written in sentences.

If there is any natural structure in the documents (e.g., in legal documents they have subparts to paragraphs with (a), (2), or (iii)), you could correct the existing sentence segmenter and retrain it so it works better next time. The main benefit is that you'll likely have more efficient annotation sessions.

Let me know if this helps!

stefan.bartell · January 19, 2023, 9:01pm

Thanks - Looks like --unsegmented did the trick, and your explanation was helpful too.

Topic		Replies	Views
Implementing ner.correct says the model you are using isn't setting sentence boundaries ner , solved	8	363	July 24, 2023
Prodigy sentence splitting during ner.correct usage , ner , spacy	3	428	February 24, 2021
prodigy splitting sentences for annotation enhancement , usage , done	14	3454	December 12, 2019
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
Question about inconsistent labeling between Prodigy and Jupyter notebook	2	225	May 2, 2023

ner.correct text split across multiple screens in Prodigy GUI

Related topics