Since the NER model only sees the already split sentences and treats them as individual examples, it also has no way of knowing the original offsets (and you don’t want to run it over the un-split text either, because this will have a negative effect on the beam search).
However, if you’re only using ner.teach
, you also don’t have to worry about pre-defined spans. So you can write your own very simple sentence splitting pre-processor that keeps a copy of the original text and the start/end offsets of the sentence. For example:
from prodigy import set_hashes
def split_sentences_with_offsets(nlp, stream, batch_size=32):
tuples = ((eg['text'], eg) for eg in stream)
for doc, eg in nlp.pipe(tuples, as_tuples=True, batch_size=batch_size):
for sent in doc.sents:
eg = copy.deepcopy(eg)
eg['original'] = {'text': eg['text'],
'sent_start': sent.start_char,
'sent_end': sent.end_char}
eg['text'] = sent.text
eg = set_hashes(eg)
yield eg
Your annotation tasks will then contain an "original"
property that will let you relate the span start and end positions back to the original document text. If you want token indices instead, you can simply use sent.start
and sent.end
. In that case, you might also want to add information about the nlp
object, like the spaCy version and model you’re using, so that you can always re-produce the tokenization.
If you don’t want to modify the ner.teach
recipe, you can also just wrap your loader and splitting logic in a simple script, and pipe its output forward to ner.teach
(if the source
argument is not set, it defaults to sys.stdin
). I describe a similar approach in my comment on this thread.
python preprocess_data.py | prodigy ner.teach your_dataset en_core_web_sm
Your script can then load the data however you want, preprocess it and print the dumped JSON of each individual annotation task:
# preprocess_data.py
import spacy
import json
from prodigy.components.loaders import JSONL
nlp = spacy.load('en_core_web_sm') # or any other model
stream = JSONL('/path/to/your/data.jsonl') # or any other loader
stream = split_sentences_with_offsets(nlp, stream) # your function above
for eg in stream:
print(json.dumps(eg)) # print the dumped JSON
If you want a more elegant solution, you can also wrap your loader in a @prodigy.recipe
and use the CLI helpers to pass in arguments via the command line. If a recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the function. (Of course you can also use plac
or a similar library instead – this depends on your personal preferences.)
prodigy preprocess en_core_web_sm data.jsonl -F recipe.py | prodigy ner.teach your_dataset en_core_web_sm
@prodigy.recipe('preprocess')
def preprocess(model, source):
# the above code here