Unsplitting annotated sentences

rory-hurley-gds · June 10, 2022, 3:17pm

Hello,

My team labelled 3,000 sentences for one entity using ner.teach and saved to a database.

We then labelled the same 3,000 sentence stream independently, using ner.correct and some of the labels used in en_core_web_lb to pre-highlight.

We intended to merge this dataset with our first dataset, however, we did not use the --unsegmented flag, and as a result, our second stream was split into sentences and therefore much longer.

Is there anyway to 'unsplit' sentences in the second dataset, so it resembles the structure of the first (whilst maintaining the spans) and can be merged with the first?

Many thanks for any help

koaning · June 23, 2022, 9:56am

I don't know if this is possible. If you passed a meta field with a unique document id then you might be able to reconstruct it afterwards by doing a "groupby" kind of operation to aggregate it back to the document level. But I fear this isn't the case.

Topic		Replies	Views
consolidating unsegmented and segmented annotations usage , ner	2	664	February 14, 2022
Prodigy sentence splitting during ner.correct usage , ner , spacy	3	428	February 24, 2021
How to split the paragraph into sentences after annotation ner	3	602	November 20, 2022
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
Questions about ner.teach and ner.correct usage , ner	10	379	January 11, 2024

Unsplitting annotated sentences

Related topics