I am working with the ner.correct recipe and it has an unexpected behaviour. Instead of show the complete text, it only shows a part of the text. Even it could only show just one or two words.
Let me show with an example what I mean:
Expected: This is just an example about what I expected to happen.
what actually happened:
one annotation: This is just:
second annotation an example about what I expected
Hi! By default, ner.correct will use the spaCy model to segment the text into sentences. You can disable this by setting the --unsegmented flag. Just make sure that the text you feed in is reasonably segmented.
Is this the actual text you're using? If so, that's definitely unexpected sentence segmentation behaviour.
The last step is when arrise the unexpected behaviour. Let me drop an actual example:
This is the full document stored as jsonl:
{"text": "Cocinero/Chef\n\nEmpresa de servicios precisa de cocinero con experiencia en colectividades para incorporación inmediata en nuestro centro ubicado en Vigo. Perfil ideal es el de una persona con formación profesional grado superior en hostelería y turismo, Grado Superior de Dirección de Cocina, Grado en ciencias grastronómicas o similar, con capacidad de liderazgo, bien organizada, capaz de manejar las situaciones estresantes, siendo meticuloso en en sus tareas y manteniendo el control de las manos.\nImprescindible dominar la gestión y dirección de una cocina.\nSalario según experiencia y formación",
"meta": {"xx": "xx",
"xx": "xx",
"xx": "xx",
"xx": "xx"}
}
(I put xx to hide some information)
And this is what prodigy shows when ner.correct is called:
As you see, instead of show the full text, it only take a random sentence...?
Yes, that definitely looks like it's the sentence segmentation. By default, ner.correct will split the text into sentences. Since you're excluding the dataset ner_positions, Prodigy may be skipping the first sentence, since an example with that input hash is already in the dataset.
If you set --unsegmented when you call ner.correct, segmentation will be disabled and you'll see the full example.