Error running textcat.batch-train if text is empty string

Hi, I’m seeing an error running textcat.batch-train:

$ prodigy textcat.batch-train my_dataset en_core_web_sm -l INTERESTING --eval-split 0.2 

Loaded model en_core_web_sm
Using 20% of examples (7571) for evaluation
Using 100% of remaining examples (30284) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  

#          LOSS       F-SCORE    ACCURACY  
Traceback (most recent call last):                                                                                                                                                                                                         
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/prodigy/__main__.py", line 235, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 130, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/prodigy/recipes/textcat.py", line 131, in batch_train
    loss += model.update(batch, revise=False, drop=dropout)
  File "cython_src/prodigy/models/textcat.pyx", line 200, in prodigy.models.textcat.TextClassifier.update
  File "cython_src/prodigy/models/textcat.pyx", line 214, in prodigy.models.textcat.TextClassifier._update
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/spacy/language.py", line 513, in pipe
    for doc in docs:
  File "pipeline.pyx", line 704, in pipe
  File "pipeline.pyx", line 709, in spacy.pipeline.TextCategorizer.predict
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
    return self.predict(x)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/api.py", line 55, in predict
    X = layer(X)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 161, in __call__
    return self.predict(x)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 125, in predict
    y, _ = self.begin_update(X)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/api.py", line 176, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/api.py", line 176, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/api.py", line 258, in wrap
    output = func(*args, **kwargs)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/api.py", line 61, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/attention.py", line 25, in begin_update
    attention, bp_attention = self._get_attention(self.Q, Xs, lengths)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/attention.py", line 47, in _get_attention
    self.ops.softmax(attention[start : start+length], inplace=True)
  File "ops.pyx", line 190, in thinc.neural.ops.Ops.softmax
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2252, in amax
    out=out, **kwargs)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/numpy/core/_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

It’s probably something related to the annotations on my dataset since I’ve successfully trained other models on my machine, but I’m currently stuck trying to decipher this error message - any leads would be appreciated. My dataset includes one label (INTERESTING) with 218 accepted examples and thousands of rejected answers. I thought the problem might have been the accepted answers were too sparse, so tried it with a smaller set of rejected examples but I still got the same error.

I noticed this thread where you don’t have to pass in the label anymore- makes sense. I’ve removed the --label argument and still get the error, so this is probably a different problem.

Based on this thread, I’ve also tried using the en_vectors_web_lg model with the same results.

Thanks for the report!

It looks like somewhere a zero-length batch is being passed through. There was a similar issue, although it failed in a different place (on a mean operation). I think there’s a problem in the sorting that sometimes forwards empty batches, and this isn’t caught.

Could you edit /Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/model.py and put a try/except block around the call to softmax, so you can print out the input array? Something like this:

# thinc/neural/_classes/model.py
def __call__(self, x):
    try:
        return self.predict(x)
    except ValueError:
        print(x)
        raise

That should give us the faulty value being passed out from spaCy’s TextCategorizer.predict, which is as far back in the call-chain we can get before we’re in the Cython modules.

Thanks for the reply!

Preliminary run, I’m getting the following for x:

[, <.. some copy from my training set..>, break]

Which is followed by the stacktrace:

Traceback (most recent call last):                                                                                                                                                                                                         
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/prodigy/__main__.py", line 235, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 130, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/prodigy/recipes/textcat.py", line 131, in batch_train
    loss += model.update(batch, revise=False, drop=dropout)
  File "cython_src/prodigy/models/textcat.pyx", line 200, in prodigy.models.textcat.TextClassifier.update
  File "cython_src/prodigy/models/textcat.pyx", line 214, in prodigy.models.textcat.TextClassifier._update
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/spacy/language.py", line 513, in pipe
    for doc in docs:
  File "pipeline.pyx", line 704, in pipe
  File "pipeline.pyx", line 709, in spacy.pipeline.TextCategorizer.predict
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 162, in __call__
    return self.predict(x)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/api.py", line 55, in predict
    X = layer(X)
  File "/Users/wei/anaconda/envs/py3/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 165, in __call__
    raise
RuntimeError: No active exception to reraise

I think I figured it out- I had a line in my import file where the text was empty: {"text": "", answer": "reject", "label": "INTERESTING"}. When I took the line out and recreated my dataset, the model finished training. A nice feature might be to check for empty annotations and prevent them from being added to the dataset.

Thanks for updating with your solution – I thought we were checking for this already, but looks like we don’t. So this is definitely a bug and will be fixed in the next release!

The best solution would probably be to filter out empty strings in the loaders, which are used by both the recipes and db-in.

Update: The issue will be fixed in the upcoming version of Prodigy. As a solution, we’ve added a stream filter utility that lets you remove examples if the value of a specified example key is missing or an empty string.

In the ner and textcat recipes, this functionality is enabled for the "text" key by default. Image recipes in the future will do the same for the key "image". And of course you can also import and use the filter in your own custom recipes for any other key(s) you might have.

I’m getting error KeyError: 'text'. Here is error log

prodigy ner.teach [dataset] 'en_core_web_sm' carecinch.jsonl 
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prodigy/__main__.py", line 238, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 143, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/util.pyx", line 173, in prodigy.util.suggest_view_id
  File "/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "cython_src/prodigy/components/sorters.pyx", line 127, in __iter__
  File "cython_src/prodigy/components/sorters.pyx", line 53, in genexpr
  File "cython_src/prodigy/models/ner.pyx", line 215, in __call__
  File "cython_src/prodigy/models/ner.pyx", line 202, in get_tasks
  File "cython_src/prodigy/models/ner.pyx", line 178, in prodigy.models.ner.EntityRecognizer.__call__.get_tasks.sort_by_entity
KeyError: 'text'

There are some lines in data where spans are empty i.e.{"text": "Patient is a white male", "spans": []} and every line has a text and spans.

Thanks; it seems like there’s two separate errors here: one when there’s a spans attribute that’s empty, and another when there’s an entry in spans that has no text key. Both issues will be fixed in the next release.

Just to confirm: have you added spans yourself in the data, or is it only Prodigy that’s edited this spans attribute?

Yes, I am using this code for adding spans thanks to @ines

spans = []
for m_id, start, end in matches:
    entity = doc[start : end]  # get slice of the document
    spans.append({'start': entity.start_char, 'end': entity.end_char, 
                  'label': nlp.vocab.strings[m_id]})

Ah, cool, this is easy to adjust then – so for now, you can simply add an additional property to the span dictionary, i.e. 'text': entity.text. And then make sure to only add the spans to your example if they’re not empty.

Did not work. Now getting score key error. It seems I’ve to provide all the attributes for ner format of prodigy.

Since you’re populating the spans manually using spaCy – are you sure you want to use the ner.teach recipe and not just the mark recipe, which will go through your examples and let you annotate them in order?

I think the problem here is that the ner.teach recipe expects you to input raw text, and will use spaCy to recognise the entities it finds, assign a score and then improve the model as you annotate. If there are already spans defined in the input, this leads to all kinds of problems later on, because those spans were not created by the model and there’s no logical way for Prodigy to handle them.

Prodigy should probably raise better errors in this case to help the user understand what’s going on. Alternatively, the NER model could also ignore or remove pre-defined spans completely – but this might be bad default behaviour, and may lead to more confusion.

Ok got the point.

It seems mark command is not working. I tried to use prodigy mark - 'sentences.txt' --label LOCATION but it just iterates the text and does not do anything. Am I doing it wrong?

mark will just iterate over whatever you give it, in that exact order, and ask you for accept/reject feedback. This is the best solution if you already have pre-annotated examples – e.g. texts with spans created using spaCy and something like the script you used above. If you want Prodigy to load a spaCy model, detect the entities for you and select the most important ones, you should use ner.teach.

Example 1 – Load in your plain-text sentences, use the en_core_web_sm model to find entities in them, and only show the ones that have the label LOCATION:

prodigy ner.teach my_dataset en_core_web_sm sentences.txt --label LOCATION

Example 2 – Load in pre-annotated examples in JSONL format and annotate them in NER mode, in the exact order they come in and without making any predictions and assumptions. The data could look like this: {"text": "Silicon Valley", "spans": [{"start": 0, "end": 14, "label": "LOCATION"}]}

prodigy mark my_dataset sentences_with_spans.jsonl --view-id ner

it seems --label LOCATION has some issues. Here is output when I try to run it with ner.teach

prodigy ner.teach my_dataset 'en_core_web_sm' sample_data.txt --label LOCATION
/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/toolz/itertoolz.py:368: RuntimeWarning: Mean of empty slice.
  return next(iter(seq))
/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/toolz/itertoolz.py:368: RuntimeWarning: Degrees of freedom <= 0 for slice
  return next(iter(seq))
/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

And browser shows No tasks available.