rel.manual not accepting entities because of tokenization


I'm trying to relate some pre-annotated entities. However, when I feed a dataset with them to rel.manual:
prodigy rel.manual test_dataset_rel en_core_web_sm dataset:test_dataset -l SUBJECT
it throws a bunch of few exceptions like this:
⚠ Skipped 27 span(s) that were already present in the input data because the tokenization didn't match.

And indeed, it drops most of the labels, especially (but not only) the ones that are several tokens long, e.g.:

I wonder what might be the cause.

Hi! In general, the idea here is that annotations should always map to valid token boundaries because you'll be making predictions over tokens later, and any mismatch means that the model can't easily be updated with the annotation.

However, it looks like in this case, the token boundaries are fine, so I wonder if this is related to whitespace? Maybe double-check the underlying JSON and check that the entity span doesn't include the trailing space? It could be as simple as an off-by-one character index on the last span.

Thanks you for the reply. I checked and none of the annotated spans had trailing whitespaces or any other characters that shouldn't be there. There shouldn't be any discrepancies in the texts, because I'm running ref.manual on the database containing the same very texts used for NER + entity annotations.

The only thing I noticed in the exported JSON is that there are annotated spans that don't have the text field, though the character/token boundaries seem correct.

E.g. how I see it in Prodigy:

What I see in the output:

At the same time I see that it is most likely not a cause, because here the second occurrence of "Russia" is ignored in ref.manual, and in the exported file there is no "text" field in either:

Maybe there's something with the way I load the recipe?

I'm adding a file with a few annotations, perhaps this can help: test_dataset.jsonl (22.2 KB)

Update: it appears there was something with the data. I tried on a different dataset and it worked well.

Sorry for the false alarm!

I ran into the same issue as the first user in this thread, which was some version of

:warning: Skipped 27 span(s) that were already present in the input data because the tokenization didn't match.

The two things that triggered it for me (~2800 samples) were leading whitespace in the text and some occasions where the would be first token wasn't in the list of tokens. All of the examples in the data had been used without issue in the prodigy ner and/or spans interfaces.

To fix the former case, I removed the leading whitespace and then decreased each start and end value in the span and tokens by 1. In the latter case, I created a doc for the part of the string that wasn't in the tokens, and added the token attributes from the new doc to the list of tokens.

import spacy
import srsly
nlp = spacy.blank("en")

# load the jsonl
prodigy_output = srsly.read_jsonl(path)
# unpack into list
prodigy_output = [x for x in prodigy_output] 

# iterate through output
for x in prodigy_output:
    # check first token in each example
    for tok in x['tokens'][:1]:
        # if problem is leading whitespace in string
        if (tok['start'] == 1) and (x['text'][0:1] == ' '):
            # remove leading white space
            x['text'] = x['text'][1:]
            # substact 1 from each start and end in spans
            temp = []
            for e in x['spans']:
                e['start'] -= 1
                e['end'] -= 1
            x['spans'] = temp
            # substact 1 from each start and end in tokens
            temp = []
            for e in x['tokens']:
                e['start'] -= 1
                e['end'] -= 1
            x['tokens'] = temp
        # if problem is first token is not in tokens
        elif (tok['start'] >= 1):
            # tokenize part of the note that is missing
            doc = nlp(x['text'][:tok['start']])
            # determine if whitespace follows
            ws = True if ' ' in x['text'][:tok['start']] else False
            # format token
            first_token = [{'text': token.text,'start': token.idx,
                            'end':token.idx+len(token.text), 'id':token.i,
                            'ws':ws} for token in doc]
            # rewrite tokens with first token included
            first_token += x['tokens']
            x['tokens'] = first_token
1 Like