Unsplitting annotated sentences


My team labelled 3,000 sentences for one entity using ner.teach and saved to a database.

We then labelled the same 3,000 sentence stream independently, using ner.correct and some of the labels used in en_core_web_lb to pre-highlight.

We intended to merge this dataset with our first dataset, however, we did not use the --unsegmented flag, and as a result, our second stream was split into sentences and therefore much longer.

Is there anyway to 'unsplit' sentences in the second dataset, so it resembles the structure of the first (whilst maintaining the spans) and can be merged with the first?

Many thanks for any help

I don't know if this is possible. If you passed a meta field with a unique document id then you might be able to reconstruct it afterwards by doing a "groupby" kind of operation to aggregate it back to the document level. But I fear this isn't the case.