I trained a model, and I'm using ner.correct to correct its behavior tagging dosages and non-dosages (expressions that look like dosages) in comments on new text.
The new text is a CSV file with one comment per line, and it looks like this:
|I'm down six pounds since starting and pretty much eating whatever I want. ...
|Originally Posted by Scarzmeanwhile he can help out in the breast feeding department...
Here is the command I'm using:
python3 -m prodigy ner.correct identify_dosages_non_dosages_validate_data_correct_SB ./identify_dosages_non_dosages_validate_data_model_SB/model-best ./../data/bb.corpus.deduplicated.cleaned.after_ner_train_merged_subtracted_dev.csv --label dosage,non_dosage
The problem is that when I enter the Prodigy GUI to correct the model's annotations, lines of text from the CSV file are being split across multiple screens (which I accept separately) as if they're multiple comments, when in fact they are only one comment. This may be a problem when I'm trying to analyze the annotations. Does anyone know how to stop the GUI from splitting comments across multiple screens?
I suspect it is
ner.correct's default sentence segmentation.
You can turn this off by adding
python -m prodigy ner.correct gold_ner en_core_web_sm ./news_headlines.jsonl --label PERSON --unsegmented
It's important to know why Prodigy was designed this way for model-in-the-loop recipes (e.g.,
ner docs explain:
You probably noticed that most of the examples on this page show short texts like sentences or paragraphs. For NER annotation, there’s often no benefit in annotating long documents at once, especially if you’re planning on training a model on the data. Annotating with a model in the loop is also much faster if the texts aren’t too long, which is why recipes like
ner.correct split sentences by default. NER model implementations also typically use a narrow contextual window of a few tokens on either side. If a human annotator can’t make a decision based on the local context, the model will struggle to learn from the data.
It would make sense to turn this off if your data isn't written in complete sentences. The sentence segmenter won't work if the input text isn't written in sentences.
If there is any natural structure in the documents (e.g., in legal documents they have subparts to paragraphs with (a), (2), or (iii)), you could correct the existing sentence segmenter and retrain it so it works better next time. The main benefit is that you'll likely have more efficient annotation sessions.
Let me know if this helps!
Thanks - Looks like
--unsegmented did the trick, and your explanation was helpful too.