Hey, I'm using prodigy to annotate some data for a ner component and then converting it to spacy format via the data-to-spacy
command to then train on spacy. I've realised the samples in my annotated data are too long for my bert model so I need to split them up into sentences. Is there an easy way to do this after already annotating the data?
Hi! One solutuin would be to take the originally annotated JSON data, split it into sentences using spaCy and then adjust the offsets. This is actually very straightforward: you just need to subtract the sent.start
offset from all character offsets (tokens and spans).
Prodigy includes a handy helper function split_sentences
that you can use for this. You can then save the data to a new dataset, and export it again using data-to-spacy
.
import spacy
from prodigy.components.db import connect
from prodigy.components.preprocess import split_sentences
nlp = spacy.load("en_core_web_sm") # or whatever you want to use for sentencizing
db = connect()
examples = db.get_dataset("your_dataset")
split_examples = list(split_sentences(nlp, examples))
Alternatively, you could also work with the .spacy
files you exported directly, which are just serialized collections of Doc
objects: https://spacy.io/api/docbin#from_disk You can then process the texts again using a model that can do sentence splitting and add the doc.ents
from your annotations. For each sentence in the doc.sents
, which is a Span
, you can then call its as_doc
method to convert it to a standalone Doc
object: https://spacy.io/api/span#as_doc You can then add these to a new DocBin
and save it out as a .spacy
file. (If you're using spaCy Projects, you can also make this more elegant by adding this script as a preprocessing step in your workflow so it runs automatically before training.)
Thanks, first way works great.