Sentencize already annotated data

01jonathanf · December 31, 2021, 5:29pm

Hey, I'm using prodigy to annotate some data for a ner component and then converting it to spacy format via the data-to-spacy command to then train on spacy. I've realised the samples in my annotated data are too long for my bert model so I need to split them up into sentences. Is there an easy way to do this after already annotating the data?

ines · January 2, 2022, 11:09am

Hi! One solutuin would be to take the originally annotated JSON data, split it into sentences using spaCy and then adjust the offsets. This is actually very straightforward: you just need to subtract the sent.start offset from all character offsets (tokens and spans).

Prodigy includes a handy helper function split_sentences that you can use for this. You can then save the data to a new dataset, and export it again using data-to-spacy.

import spacy
from prodigy.components.db import connect
from prodigy.components.preprocess import split_sentences

nlp = spacy.load("en_core_web_sm")  # or whatever you want to use for sentencizing
db = connect()
examples = db.get_dataset("your_dataset")
split_examples = list(split_sentences(nlp, examples))

Alternatively, you could also work with the .spacy files you exported directly, which are just serialized collections of Doc objects: https://spacy.io/api/docbin#from_disk You can then process the texts again using a model that can do sentence splitting and add the doc.ents from your annotations. For each sentence in the doc.sents, which is a Span, you can then call its as_doc method to convert it to a standalone Doc object: https://spacy.io/api/span#as_doc You can then add these to a new DocBin and save it out as a .spacy file. (If you're using spaCy Projects, you can also make this more elegant by adding this script as a preprocessing step in your workflow so it runs automatically before training.)

01jonathanf · January 4, 2022, 12:30pm

Thanks, first way works great.

Topic		Replies	Views
combining two annotated datasets usage , ner , spacy , solved	5	1526	July 28, 2020
Split a ner.manual dataset, into smaller texts usage , ner , spacy	3	1142	June 24, 2022
Convert spaCy training json file to prodigy jsonl format for db-in command enhancement , ner , spacy	1	594	June 15, 2020
How to split the paragraph into sentences after annotation ner	3	622	November 20, 2022
"data-to"spacy" does not sentencize text based on custom sentencizer. enhancement , ner , done , spacy	2	1266	June 17, 2020

Sentencize already annotated data

Related topics