Can I recover offsets into the document from ner.teach?

I’m running ner.teach in its standard configuration. (No custom recipe.) I give it documents, it segments those documents into sentences, and solicits annotations of spans within the sentences. Prodigy stores this span information as character offsets relative to the sentence. Is there any way I can instead get character offsets (or better yet, token offsets) relative to the entire document?

(I think the answer is “no” out of the box, but maybe there’s a way I can write a custom recipe? Maybe slip sentence offset information into the meta field.)

Offsets relative to the sentence are sufficient for training data, but I’m experimenting with ways in which Prodigy’s opinionated active learning workflow can be used as a kind of guided search, even when there isn’t enough data to train a model.

(I suppose I could create a corpus where each document was a sentence and add meta information saying where the sentence began in the document. But is there some way this has already been done for me?)

Since the NER model only sees the already split sentences and treats them as individual examples, it also has no way of knowing the original offsets (and you don’t want to run it over the un-split text either, because this will have a negative effect on the beam search).

However, if you’re only using ner.teach, you also don’t have to worry about pre-defined spans. So you can write your own very simple sentence splitting pre-processor that keeps a copy of the original text and the start/end offsets of the sentence. For example:

from prodigy import set_hashes

def split_sentences_with_offsets(nlp, stream, batch_size=32):
    tuples = ((eg['text'], eg) for eg in stream)
    for doc, eg in nlp.pipe(tuples, as_tuples=True, batch_size=batch_size):
        for sent in doc.sents:
            eg = copy.deepcopy(eg)
            eg['original'] = {'text': eg['text'], 
                              'sent_start': sent.start_char,
                              'sent_end': sent.end_char}
            eg['text'] = sent.text
            eg = set_hashes(eg)
            yield eg

Your annotation tasks will then contain an "original" property that will let you relate the span start and end positions back to the original document text. If you want token indices instead, you can simply use sent.start and sent.end. In that case, you might also want to add information about the nlp object, like the spaCy version and model you’re using, so that you can always re-produce the tokenization.

If you don’t want to modify the ner.teach recipe, you can also just wrap your loader and splitting logic in a simple script, and pipe its output forward to ner.teach (if the source argument is not set, it defaults to sys.stdin). I describe a similar approach in my comment on this thread.

python preprocess_data.py | prodigy ner.teach your_dataset en_core_web_sm

Your script can then load the data however you want, preprocess it and print the dumped JSON of each individual annotation task:

# preprocess_data.py
import spacy
import json
from prodigy.components.loaders import JSONL

nlp = spacy.load('en_core_web_sm')  # or any other model
stream = JSONL('/path/to/your/data.jsonl')  # or any other loader
stream = split_sentences_with_offsets(nlp, stream)  # your function above

for eg in stream:
    print(json.dumps(eg))  # print the dumped JSON

If you want a more elegant solution, you can also wrap your loader in a @prodigy.recipe and use the CLI helpers to pass in arguments via the command line. If a recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the function. (Of course you can also use plac or a similar library instead – this depends on your personal preferences.)

prodigy preprocess en_core_web_sm data.jsonl -F recipe.py | prodigy ner.teach your_dataset en_core_web_sm
@prodigy.recipe('preprocess')
def preprocess(model, source):
    # the above code here

Thanks. I’ll try this.

Just now I tried a preprocessing strategy in which I divided the documents up into sentences, wrote the sentence start character into the meta, then used a custom ner.teach recipe that did not call stream = split_sentences(model.orig_nlp, stream). I thought that this would give me what I want, but only the first annotation contained the meta field of the original document. All the others just contained pattern and score values. Is this expected behavior?

(Then I did the same thing except I put the extra information in a section called document instead of meta and that was passed along. So I guess I should treat meta as a volatile part of the example that might get overwritten by Prodigy.)

Ah damn, thanks! I actually tracked this down to a bug in the PatternMatcher that would overwrite the annotation task's meta, instead of checking if it exists and adding to it. Just fixed that – in general, Prodigy shouldn't overwrite the meta and only add to it.

Ok this is weird. As described above, I created a corpus with the documents segmented with way I wanted them. In each document JSON record I created an _id value and a document key with a start_char value saying where the text started relative to the beginning of the document.

{
    "_id": "document-1",
    "text": "Lorem ipsum etc. etc...",
    "document": {"start_char": 53}
}

I did a quick ner.teach session in which I labeled 62 of these documents (using a custom recipe to not further divide them into single sentences.) When I dumped the annotations from the database, 53 of them contained the original document information with _id and document in addition to the fields that Prodigy adds. 9 of them did not. These 9 dropped the _id and document fields, though they also had the Prodigy annotation information.

I can’t see a reason why these 9 should be treated differently. Is this a bug?

Thanks for investigating and sharing your findings!

This might actually be related to a similar issue like the one I described above. If you're using ner.teach with patterns, Prodigy will essentially use two models: the EntityRecognizer and the PatternMatcher. On a per-batch basis, the results of both models will be combined using the interleave function, so you'll always be annotating a mix of both (pattern matches and model suggestions).

So there might be some case in one of the models where Prodigy reconstructs the task or doesn't deepcopy the existing example cleanly, causing custom properties to be dropped. If this only happens in one specific branch, you might end up with some examples that have this problem, but not others.

Is there anything stat stands out about those 9 examples? For example, do they have a a pattern ID assigned in the meta, whereas the others don't? Or vice versa? Alternatively, is it possible that those 9 examples have multiple "spans" assigned, and the others only have one? (There's one branch in the NER model that looks potentially suspicious – so if the length of the spans is the differentiator, I think I know what the issue is and how to fix it.)

Edit: Just released v1.4.0, which includes several fixes that ensure tasks are not reconstructed and custom properties are not overwritten. Could you try again with the new version and see if it all works as expected now?

Everything works in version 1.4.0. All the information in the original document records is passed along.

1 Like