Best way to prepare a long text for annotations

Andrey · August 28, 2018, 11:06am

I have a collection of texts, where some of them are quite long (20-30 sentences). I am using Prodigy to annotate and create a new NER model (a new entity type). I noticed, that some of the large texts are displayed as very short snippets and in some cases the entities of interested are not shown.

Should I preprocess my long texts before I feed them into Prodigy? If yes, what is the best way to prepare them? For example, split long texts into several chunks of 3-5 sentences or similar?

Thanks.

ines · August 28, 2018, 11:27am

This can happen if the sentence boundary detection (which is based on the dependency parse) isn't 100% accurate – for example, if your sentences are non-standard or different from general news and web text. By default, Prodigy will split the text into sentences using the doc.sents. You can turn this behaviour off by setting the --unsegmented flag.

20-30 sentences per text is obviously very long, so you probably want to use your own logic to segment the text into smaller chunks. You definitely want to be working on smaller units wherever possible. It doesn't only make the process faster, because you have to read less, but it can also improve performance, since Prodigy won't have to compute all possible parses for a huge text.

How you split up your text depends on the structure – but you can still use spaCy's sentence segmentation features to do this more efficiently, then export the result as JSONL and load it in (or do the whole thing in a custom recipe, whichever you prefer).

Andrey · August 28, 2018, 1:43pm

Hi Ines, thanks for suggestion, I was trying to do the same thing but in a bit more awkward way. Great spaCy functionality!

ines · August 29, 2018, 10:11am

Nice to hear! Btw, also in case others come across this thread later: for more advanced pre-processing (whitespace, mojibake etc.), you might also want to check out textacy:

Andrey · August 29, 2018, 10:12am

Brilliant, thanks Ines!

Topic		Replies	Views
prodigy splitting sentences for annotation enhancement , usage , done	14	3478	December 12, 2019
New to Prodigy: Annotation Structure Advice (Big Section of Text vs Separating Sentences) usage , ner , spancat	2	335	November 20, 2023
How to split the paragraph into sentences after annotation ner	3	681	November 20, 2022
Sentence Segmentation and Annotations usage , spacy , legal	2	1576	January 23, 2020
Strange text segmentation with ner.teach recipe usage	7	616	September 9, 2019

Best way to prepare a long text for annotations

Related topics