ner.manual task with add_tokens and skip=True fails with KeyError.

alejandro.mesa · December 5, 2018, 1:37am

Hello, I have a ner.manual task with the text being 12556 characters long and a pre-loaded set of spans. The spans were crated outside of Prodigy, so the idea of the recipe is to simply confirm that the spans are correct, and if not, fix the ones that are incorrect. I’m calling the add_tokens function with skip=True, but prodigy still fails with KeyError: 9968. I’m not really sure how to proceed. Any ideas would be appreciated. Also, I should mentioned that I’m using prodigy 1.5.1.

Thanks

ines · December 5, 2018, 1:53am

Sorry about that – I think this might be related to a bug in the preprocessor, which caused the skip setting to not be respected correctly in some cases. We’ve already fixed this and will ship a new release soon that includes this fix.

In the meantime, you could try something like this to add the functionality in your code:

def skip_mismatched_tokens(stream):
    '''Skip examples where the tokenisation doesn't align to the spans.'''
    for eg in stream:
        if all_spans_match_tokens(eg['spans'], eg['tokens']):
            yield eg

def all_spans_match_tokens(spans, tokens):
    '''Check whether any spans don't align to tokens.'''
    if not spans:
        return True
    starts = set(token['start'] for token in tokens)
    ends = set(token['end'] for token in tokens)
    for span in spans:
        if 'token_start' not in span and span['start'] not in starts:
            return False
        if 'token_end' not in span and span['end'] not in ends:
            return False
    return True

alejandro.mesa · December 6, 2018, 7:10pm

Hi Ines, thanks for your response. Your code above assumes that I have both the spans and tokens, but in my case, I only have the spans. I use the add_tokens function to add the tokens, but that function is the one throws the KeyError exception. So, I can’t really use your logic. Or am I missing something?

Thanks

ines · December 6, 2018, 7:54pm

Ah, sorry – yeah, I mostly tried to focus on the logic that matches the spans up with the tokens, since that's the most tricky ones. To add the tokens, you could do something like this:

for eg in examples:
    doc = nlp.make_doc(eg['text'])
    eg['tokens'] = [{'text': token.text, 'start': token.idx,
                     'end': token.idx + len(token.text), 'id': i}
                    for i, token in enumerate(doc)]

Basically, you're using the nlp object with a loaded spaCy model to tokenize the text and then write it out as one dict per token with the expected values.

The tricker part is then matching up the existing spans with the tokens, to ensure that spaCy's tokenization will actually produce tokens for the given entities. NER works on a per-token basis, so if your tokenization doesn't match, your model might perform much worse, because it's learned from tokens that it will never actually produce "in real life".

alejandro.mesa · December 6, 2018, 11:50pm

That works, thanks!

yg37 · December 11, 2018, 1:27am

It works well! Thanks!

Topic		Replies	Views
Skip mismatched tokenization? usage , ner , spacy , solved	2	396	February 8, 2022
Insert Exception to skip cases where tokens are misaligned. usage , ner , spacy	1	480	October 12, 2020
Mismatched Tokenization on NER usage , ner	2	1139	June 25, 2021
Combining ner.teach with patterns file and manual correction of spans usage , ner , front-end	2	787	September 11, 2020
ValueError: Mismatched tokenization. in ner.make-gold ner , done	5	1450	March 11, 2018

ner.manual task with add_tokens and skip=True fails with KeyError.

Related topics