Can I recover offsets into the document from ner.teach?

ines · March 9, 2018, 5:10pm

Since the NER model only sees the already split sentences and treats them as individual examples, it also has no way of knowing the original offsets (and you don’t want to run it over the un-split text either, because this will have a negative effect on the beam search).

However, if you’re only using ner.teach, you also don’t have to worry about pre-defined spans. So you can write your own very simple sentence splitting pre-processor that keeps a copy of the original text and the start/end offsets of the sentence. For example:

from prodigy import set_hashes

def split_sentences_with_offsets(nlp, stream, batch_size=32):
    tuples = ((eg['text'], eg) for eg in stream)
    for doc, eg in nlp.pipe(tuples, as_tuples=True, batch_size=batch_size):
        for sent in doc.sents:
            eg = copy.deepcopy(eg)
            eg['original'] = {'text': eg['text'], 
                              'sent_start': sent.start_char,
                              'sent_end': sent.end_char}
            eg['text'] = sent.text
            eg = set_hashes(eg)
            yield eg

Your annotation tasks will then contain an "original" property that will let you relate the span start and end positions back to the original document text. If you want token indices instead, you can simply use sent.start and sent.end. In that case, you might also want to add information about the nlp object, like the spaCy version and model you’re using, so that you can always re-produce the tokenization.

If you don’t want to modify the ner.teach recipe, you can also just wrap your loader and splitting logic in a simple script, and pipe its output forward to ner.teach (if the source argument is not set, it defaults to sys.stdin). I describe a similar approach in my comment on this thread.

python preprocess_data.py | prodigy ner.teach your_dataset en_core_web_sm

Your script can then load the data however you want, preprocess it and print the dumped JSON of each individual annotation task:

# preprocess_data.py
import spacy
import json
from prodigy.components.loaders import JSONL

nlp = spacy.load('en_core_web_sm')  # or any other model
stream = JSONL('/path/to/your/data.jsonl')  # or any other loader
stream = split_sentences_with_offsets(nlp, stream)  # your function above

for eg in stream:
    print(json.dumps(eg))  # print the dumped JSON

If you want a more elegant solution, you can also wrap your loader in a @prodigy.recipe and use the CLI helpers to pass in arguments via the command line. If a recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the function. (Of course you can also use plac or a similar library instead – this depends on your personal preferences.)

prodigy preprocess en_core_web_sm data.jsonl -F recipe.py | prodigy ner.teach your_dataset en_core_web_sm

@prodigy.recipe('preprocess')
def preprocess(model, source):
    # the above code here

Topic		Replies	Views
Partially Fixed: ner.batch-train's split_sentences does not properly handle tokens and spans ner , done	1	504	October 1, 2018
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
Providing NER token spans only (no character offsets) usage , spacy , best-practices	2	1875	August 12, 2019
Combining ner.teach with patterns file and manual correction of spans usage , ner , front-end	2	787	September 11, 2020
Tokenization causes glitched text usage , ner , solved	1	376	November 2, 2021

Can I recover offsets into the document from ner.teach?

Related topics