Error when using textcat.teach with small number of samples

I’ve been running into an bug in textcat.teach that I think is being caused by having only a small number of samples for annotation. I don’t get the warning when I use 100s of examples, but do when I have only a few dozen. I can reproduce it with a subset of the large file, so I don’t think it’s a jsonl problem.

The warning doesn’t come when I start Prodigy, but when I open the page in the browser. The page then says “No tasks available”. I’m using 0.2.1. Let me know if you want the two files to reproduce.

/Users/ahalterman/anaconda3/lib/python3.6/site-packages/prodigy/app.py:58: RuntimeWarning: Mean of empty slice.
  tasks = controller.get_questions()
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/prodigy/app.py:58: RuntimeWarning: Degrees of freedom <= 0 for slice
  tasks = controller.get_questions()
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

Thanks. Looks like we get an empty batch at some point, that’s being returned. This seems to be causing a divide-by-zero that leads to nan infecting the model parameters.

This issue should now be fixed in the latest update, Prodigy v0.3.0! :tada:

Hi there!

I just started with Prodigy (thanks a lot!) and I am getting something similar with prodigy 0.4.0. It is reproducible using the news_headlines.jsonl example:

prodigy dataset my_set "A new dataset" --author Me
prodigy textcat.teach my_set en_core_web_sm news_headlines.jsonl --label POLITICS

works well. Then if I delete some lines, at some point it will give me this error:

~/.virtualenvs/spacy2/lib/python3.5/site-packages/prodigy/app.py:58: RuntimeWarning: Mean of empty slice.
  tasks = controller.get_questions()
~/.virtualenvs/spacy2/lib/python3.5/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
~/.virtualenvs/spacy2/lib/python3.5/site-packages/prodigy/app.py:58: RuntimeWarning: Degrees of freedom <= 0 for slice
  tasks = controller.get_questions()
~/.virtualenvs/spacy2/lib/python3.5/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
~/.virtualenvs/spacy2/lib/python3.5/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

I am not sure it has to do with the number of lines, it seems to be a bit random (is it possible it is due to the random batch selection). For example, it works with 71 lines, then I remove 2 lines, it doesn’t work anymore, if I add 2 more lines it doesn’t work either. Then I remove more lines (down to 56 lines) and it works. It is a bit puzzling :slight_smile:

any idea what would cause that problem?
thanks!

Could that be linked to this actually:

?

Hm! Thanks.

Unfortunately because the error’s in one of the compiled modules, it’s tricky to give you a good mitigation. Maybe you could edit the ~/.virtualenvs/spacy2/lib/python3.5/site-packages/prodigy/app.py file and do something stupid like:


try:
    tasks = controller.get_questions()
except:
    continue

Somewhere in the app we’re hitting an empty batch that we haven’t checked for, and that’s ultimately causing the call to numpy.mean to fail. So maybe just streaming past that part of the data will help.

Thanks for the suggestion, I gave it a try, here are some observations:

  • np.seterr(all='raise') needs to be added to make the try/catch on get_questions() to work (the messages in the console are actually warning, figured it out after realizing the catch wasn’t working :p)
  • the continue instruction cannot apply since the route is not in a loop. Returning None or {'tasks': None, 'total': controller.total_annotated, 'progress': controller.progress} yields to an actual error message in the interface.

It looks like the empty batch is coming from score of the sentence not being high enough (after reading the issue previously posted). In the case I am trying, the goal is to classify the text regardless of its associated score: I’ll look into custom recipe, it would be the best way to do it right?

Thanks for the detailed feedback – much appreciated!

If you just want to label the examples as they come in and skip the "active learning component", i.e. the resorting of the stream, you could simply use the mark recipe, which has pretty much the same API as the teach recipes. In the long run, you'll likely want more flexibility, though, so a custom recipe would definitely be the way to go.

this is simply awesome! The mark recipe is doing the trick beautifully!!

(Thank you so much Ines and Matthew, Spacy is a beautiful product and Prodigy is going to be an excellent addition to promote Spacy usage)

1 Like

I just got the same error coming up again in 1.1.0. It happened right when the model switched from using seeds to the nascent model. It could be that none of the documents were relevant after I exhausted the seed list, but wanted to bring it up.

/Users/ahalterman/anaconda3/lib/python3.6/site-packages/prodigy/app.py:60: RuntimeWarning: Mean of empty slice.
  tasks = controller.get_questions()
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/prodigy/app.py:60: RuntimeWarning: Degrees of freedom <= 0 for slice
  tasks = controller.get_questions()
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/Users/ahalterman/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

@andy Thanks, this is interesting. Will investigate!

Thanks for the detailed report – this is interesting, I haven't seen this one before. It looks like it ends up computing nan somewhere... I'm glad it's still working though, and the results look good!

Could you run one of the commands again and set the environment variable PRODIGY_LOGGING=basic? This will log everything that's going on across the different components, and the last entry in the log should hopefully give us a better idea of where exactly the error occurs.

Edit: Missed this:

Sorry, the conda-forge builds always take a little longer. Will ping @honnibal and see if there's still anything left to do to get v2.0.5 up. If you're running Prodigy with spaCy v2.0.4, this might be a possible explanation, as v2.0.5 fixes an issue with the vector pickling that caused vectors to be set to None. So you could also try and re-run your experiment in a different environment with Prodigy v1.1.0 and spaCy v2.0.5.

Thanks for the logging tip!

I uninstalled-reinstalled anaconda, created a new environment and the error remained.

I also got the following while annotating:

22:12:07 - RESPONSE: /give_answers
22:12:04 - GET: /get_questions
/Users/me/anaconda3/lib/python3.6/site-packages/prodigy/app.py:60: RuntimeWarning: Mean of empty slice.
  tasks = controller.get_questions()
/Users/me/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/me/anaconda3/lib/python3.6/site-packages/prodigy/app.py:60: RuntimeWarning: Degrees of freedom <= 0 for slice
  tasks = controller.get_questions()
/Users/me/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/Users/me/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
22:12:04 - CONTROLLER: Returning a batch of tasks from the queue
22:12:04 - RESPONSE: /get_questions (10 examples)

Thanks – this error looks like it might be related to the case reported in this thread. One possible cause could be that Prodigy ends up with an empty batch somewhere, and not enough sentences with high enough scores.

Will definitely investigate this and hopefully fix it for the next version! :+1:

When do you think the next version would be released?

In the new year, as soon as Matt and I are back from our holidays. We want to avoid pushing any updates while we're both on bad family internet connections in different parts of the world :wink:

1 Like

I’m sorry, I am just too excited about prodigy and couldn’t think of that.

Happy holidays!

Saving related errors here:

/Users/me/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py:2168: RuntimeWarning: invalid value encountered in sqrt
  ret = sqrt(sqnorm)
/Users/me/anaconda3/lib/python3.6/site-packages/prodigy/app.py:60: RuntimeWarning: Mean of empty slice.
  tasks = controller.get_questions()
/Users/me/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
/Users/me/anaconda3/lib/python3.6/site-packages/prodigy/app.py:60: RuntimeWarning: Degrees of freedom <= 0 for slice
  tasks = controller.get_questions()
/Users/me/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:105: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/Users/me/anaconda3/lib/python3.6/site-packages/numpy/core/_methods.py:127: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)

Hi,

Thanks for the amazing work you have done. :tophat:

When I launch prodigy, also when I start the training, I get the error below:

lib/python3.6/site-packages/numpy/linalg/linalg.py:2168: RuntimeWarning: invalid value encountered in sqrt

However, everything runs and there is no crash or anything. Since this is my first attempt, I couldn’t be sure why this happened or if something went wrong with the model. I just don’t want my annotation efforts go wasted :slight_smile:

Initialising with 10 seed terms from seeds/myseeds.txt
Found 762 examples with seeds

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

* * * SHOWS UP AFTER SAVING FROM UI * * *

/Users/me/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py:2168: RuntimeWarning: invalid value encountered in sqrt
  ret = sqrt(sqnorm)
^C
Saved 11 annotations to database SQLite
Dataset: leapdive_posts
Session ID: 2017-12-26_17-50-53

When I start training - the error shows up at 0%

Loaded model en_vectors_web_lg
Using 20% of examples (270) for evaluation
Using 100% of remaining examples (1082) for training
Dropout: 0.2  Batch size: 10  Iterations: 10

#          LOSS       F-SCORE    ACCURACY
  0%|                                                                                                                                                                                                                                                                                                                                              
/Users/me/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py:2168: RuntimeWarning: invalid value encountered in sqrt
  ret = sqrt(sqnorm)

01         39.775     0.547      0.784
02         31.491     0.871      0.922
03         26.007     0.924      0.952
04         23.654     0.919      0.948
05         23.963     0.913      0.944
06         23.234     0.913      0.944
07         23.251     0.913      0.944
08         23.249     0.913      0.944
09         22.392     0.912      0.944
10         24.196     0.913      0.944

MODEL      USER       COUNT
accept     accept     79
accept     reject     8
reject     reject     177
reject     accept     5


Correct    256
Incorrect  13


Baseline   0.69
Precision  0.91
Recall     0.94
F-score    0.92
Accuracy   0.95

When I first created the seed terms file, I put space seperated multiple words on a single line thinking that it would bring me entries containing those. I understand that single words were expected. Now, I fixed the file.

Further details:

  • Python 3.6.4 :: Anaconda custom (64-bit)
  • numpy has ‘mkl_rt’, ‘pthread’ (v1.13.3)
  • During prodigy installation I noticed that spacy v2.0.5 was installed and this was a pip installation, not a conda-forge one. I can see the v2.0.4 when I do conda list
  • Text material I am annotating has rather (not all) long messages. I saw the -L long text mode but have not tried it.

Thanks in advance!

Just merged the two threads related to this issue. The main difficulty here is to figure out what type of input causes the error, since it only seems to occur in certain cases. So if you end up coming across a suspicious pattern in the data, this would be very helpful! (You can also set PRODIGY_LOGGING=verbose to log the individual tasks that are passing through the application.)