setting unsegmented=True throws KeyError in ner.teach

I’m trying to not allow sentences to split in the ner.teach recipe.

@prodigy.recipe(‘ner.teach.wrapper’)
def ner_teach_wrapper(dataset, spacy_model, language, label=None, unsegmented=True):

and thus when i set unsegmented=True its throwing following error. Works perfectly fine when i leave the unsegmented option to its default setting.

tasks = controller.get_questions()
File “cython_src/prodigy/core.pyx”, line 87, in prodigy.core.Controller.get_questions
File “cython_src/prodigy/core.pyx”, line 71, in iter_tasks
File “cython_src/prodigy/components/sorters.pyx”, line 136, in iter
File “cython_src/prodigy/components/sorters.pyx”, line 51, in genexpr
File “cython_src/prodigy/models/ner.pyx”, line 260, in call
File “cython_src/prodigy/models/ner.pyx”, line 228, in get_tasks
File “cytoolz/itertoolz.pyx”, line 1046, in cytoolz.itertoolz.partition_all.next (cytoolz/itertoolz.c:14538)
File “cython_src/prodigy/models/ner.pyx”, line 206, in predict_spans
KeyError: ‘_input_hash’

I cant find the cause of this. Any idea?

Thanks for the report! This is strange… for some reason, the hashes don’t seem to get added correctly to the stream, even though the loader should take care of this :thinking: The split_sentences preprocessor (which is used if you do want to segment the text) rehashes the stream again after segmenting, so I guess that’s why the problem doesn’t occur here. It’s still mysterious, though, because I don’t understand how the hashes would get lost…

I’ll investigate this – pretty sure we can still get a fix in for the upcoming release!

In the meantime, you can try loading and hashing your stream manually before you pass it into ner.teach. If you’re calling the ner.teach recipe function directly from your wrapper, you can also pass in an already loaded stream as the source argument (instead of a string). Here’s an example of the loading and hashing:

from prodigy.components.loaders import JSONL  # or however you want to load it
from prodigy.util import set_hashes

stream = JSONL(your_source)
stream = (set_hashes(eg) for eg in stream)

Thanks for the prompt workaround. It works now. Looking forward to the fix in next release.

Just released v1.5.0, which should fix this problem. All streams that pass through the built-in recipes are now hashed before they are processed by the model.