Hello, I have a ner.manual task with the text being 12556 characters long and a pre-loaded set of spans. The spans were crated outside of Prodigy, so the idea of the recipe is to simply confirm that the spans are correct, and if not, fix the ones that are incorrect. I’m calling the add_tokens function with skip=True, but prodigy still fails with KeyError: 9968. I’m not really sure how to proceed. Any ideas would be appreciated. Also, I should mentioned that I’m using prodigy 1.5.1.
Sorry about that – I think this might be related to a bug in the preprocessor, which caused the skip setting to not be respected correctly in some cases. We’ve already fixed this and will ship a new release soon that includes this fix.
In the meantime, you could try something like this to add the functionality in your code:
def skip_mismatched_tokens(stream):
'''Skip examples where the tokenisation doesn't align to the spans.'''
for eg in stream:
if all_spans_match_tokens(eg['spans'], eg['tokens']):
yield eg
def all_spans_match_tokens(spans, tokens):
'''Check whether any spans don't align to tokens.'''
if not spans:
return True
starts = set(token['start'] for token in tokens)
ends = set(token['end'] for token in tokens)
for span in spans:
if 'token_start' not in span and span['start'] not in starts:
return False
if 'token_end' not in span and span['end'] not in ends:
return False
return True
Hi Ines, thanks for your response. Your code above assumes that I have both the spans and tokens, but in my case, I only have the spans. I use the add_tokens function to add the tokens, but that function is the one throws the KeyError exception. So, I can’t really use your logic. Or am I missing something?
Ah, sorry – yeah, I mostly tried to focus on the logic that matches the spans up with the tokens, since that's the most tricky ones. To add the tokens, you could do something like this:
for eg in examples:
doc = nlp.make_doc(eg['text'])
eg['tokens'] = [{'text': token.text, 'start': token.idx,
'end': token.idx + len(token.text), 'id': i}
for i, token in enumerate(doc)]
Basically, you're using the nlp object with a loaded spaCy model to tokenize the text and then write it out as one dict per token with the expected values.
The tricker part is then matching up the existing spans with the tokens, to ensure that spaCy's tokenization will actually produce tokens for the given entities. NER works on a per-token basis, so if your tokenization doesn't match, your model might perform much worse, because it's learned from tokens that it will never actually produce "in real life".