Sentencize already annotated data

Hey, I'm using prodigy to annotate some data for a ner component and then converting it to spacy format via the data-to-spacy command to then train on spacy. I've realised the samples in my annotated data are too long for my bert model so I need to split them up into sentences. Is there an easy way to do this after already annotating the data?

Hi! One solutuin would be to take the originally annotated JSON data, split it into sentences using spaCy and then adjust the offsets. This is actually very straightforward: you just need to subtract the sent.start offset from all character offsets (tokens and spans).

Prodigy includes a handy helper function split_sentences that you can use for this. You can then save the data to a new dataset, and export it again using data-to-spacy.

import spacy
from prodigy.components.db import connect
from prodigy.components.preprocess import split_sentences

nlp = spacy.load("en_core_web_sm")  # or whatever you want to use for sentencizing
db = connect()
examples = db.get_dataset("your_dataset")
split_examples = list(split_sentences(nlp, examples))

Alternatively, you could also work with the .spacy files you exported directly, which are just serialized collections of Doc objects: https://spacy.io/api/docbin#from_disk You can then process the texts again using a model that can do sentence splitting and add the doc.ents from your annotations. For each sentence in the doc.sents, which is a Span, you can then call its as_doc method to convert it to a standalone Doc object: https://spacy.io/api/span#as_doc You can then add these to a new DocBin and save it out as a .spacy file. (If you're using spaCy Projects, you can also make this more elegant by adding this script as a preprocessing step in your workflow so it runs automatically before training.)

2 Likes

Thanks, first way works great.

1 Like