I have a collection of texts, where some of them are quite long (20-30 sentences). I am using Prodigy to annotate and create a new NER model (a new entity type). I noticed, that some of the large texts are displayed as very short snippets and in some cases the entities of interested are not shown.
Should I preprocess my long texts before I feed them into Prodigy? If yes, what is the best way to prepare them? For example, split long texts into several chunks of 3-5 sentences or similar?
This can happen if the sentence boundary detection (which is based on the dependency parse) isn’t 100% accurate – for example, if your sentences are non-standard or different from general news and web text. By default, Prodigy will split the text into sentences using the
doc.sents. You can turn this behaviour off by setting the
20-30 sentences per text is obviously very long, so you probably want to use your own logic to segment the text into smaller chunks. You definitely want to be working on smaller units wherever possible. It doesn’t only make the process faster, because you have to read less, but it can also improve performance, since Prodigy won’t have to compute all possible parses for a huge text.
How you split up your text depends on the structure – but you can still use spaCy’s sentence segmentation features to do this more efficiently, then export the result as JSONL and load it in (or do the whole thing in a custom recipe, whichever you prefer).
Hi Ines, thanks for suggestion, I was trying to do the same thing but in a bit more awkward way. Great spaCy functionality!
Nice to hear! Btw, also in case others come across this thread later: for more advanced pre-processing (whitespace, mojibake etc.), you might also want to check out textacy: