terms.teach: OverflowError: Python int too large to convert to SQLite INTEGER

terms
windows
done
database
solved

(Amy Huntington) #1

OS: Windows 10
Python: 3.7
Dist: Conda
pip installed prodigy without issues

Just got a research trial license yesterday (thanks, btw, can’t wait to show my colleagues at Northwestern!). Installed everything smoothly, default SQLite, etc. Began this training video with work-specific training (https://prodi.gy/docs/video-new-entity-type) and got the following error in conda console:

(spacy_env) C:\Users\ash9984>python -m prodigy terms.teach CAPS_terms en_core_web_lg --seeds “anxiety”
Initialising with 1 seed terms: anxiety

? Starting the web server at http://localhost:8080
Open the app in your browser and start annotating!

08:53:09 - Task queue depth is 1

09:02:19 - Exception when serving /give_answers
Traceback (most recent call last):
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\waitress\channel.py”, line 336, in service
task.service()
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\waitress\task.py”, line 175, in service
self.execute()
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\waitress\task.py”, line 452, in execute
app_iter = self.channel.server.application(env, start_response)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\api.py”, line 423, in api_auto_instantiate
return module.hug_wsgi(*args, **kwargs)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\falcon\api.py”, line 244, in call
responder(req, resp, **params)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py”, line 793, in call
raise exception
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py”, line 766, in call
self.render_content(self.call_function(input_parameters), context, request, response, **kwargs)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py”, line 703, in call_function
return self.interface(**parameters)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py”, line 100, in call
return __hug_internal_self._function(*args, **kwargs)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\prodigy\app.py”, line 173, in give_answers
controller.receive_answers(answers, session_id=session_id)
File “cython_src\prodigy\core.pyx”, line 127, in prodigy.core.Controller.receive_answers
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\prodigy\components\db.py”, line 303, in add_examples
content=content)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py”, line 4977, in create
inst.save(force_insert=True)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py”, line 5170, in save
pk_from_cursor = self.insert(**field_dict).execute()
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py”, line 3584, in execute
cursor = self._execute()
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py”, line 2939, in _execute
return self.database.execute_sql(sql, params, self.require_commit)
File “C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py”, line 3830, in execute_sql
cursor.execute(sql, params or ())
OverflowError: Python int too large to convert to SQLite INTEGER

image

I get this error if I feed in more than 1 training term (eg, “anxiety, depression”). If I use just 1 term, server connection works but once I hit 21 session choices, I get above error in conda console and “ERROR: Couldn’t save annotations. make sure the server is running correctly.” message in training console. I’ve tried to suss out why this is happening multiple ways (variations of reject, accept, ignore, etc) and 21 seems to always trigger a disconnect from the server. Time doesn’t seem to matter, as I let minutes pass between choices.

I also reinstalled everything in a virtual environment this morning to see if that changed anything but sadly the same error persists.

Thanks in advance!!


(Ines Montani) #2

Hi! Thanks for the detailed report and sorry for the frustration. It looks like some annotation it’s trying to save causes SQLite to fail, which in turn kills the Python process – and as soon as the web app tries to request new questions or send back answers, it notices that the server is gone and complains as well.

OverflowError: Python int too large to convert to SQLite INTEGER

Could you double-check that the Python version you’re running is 64-bit (not 32) and that your environment is on Python 3+?


(Amy Huntington) #3

Wow! Thanks for the quick response! No need to apologize, I’m a big fan of y’alls work!!!

Here ya go:
(base) C:\Users\ash9984>python --version
Python 3.6.3 :: Anaconda custom (64-bit)

Virtual:
(spacy_env) C:\Users\ash9984>python --version
Python 3.7.2


(Ines Montani) #4

Thanks and hmmm, this is really really strange. Coud you run conda list and check which version of sqlite it has installed by default? Maybe you’ve ended up with some old version with bad defaults compiled into it (which is a known issue).

To help debug this, could you find line 302 in prodigy/components/db.py and add a print statement above it that outputs the example it’s adding (to find the last one it eventually fails on)? For example, like this:

print(eg)
eg = Example.create(input_hash=eg[INPUT_HASH_ATTR],
                    task_hash=eg[TASK_HASH_ATTR],
                    content=content)

To find the location of your Prodigy installation, you can run the following:

python -c "import prodigy; print(prodigy.__file__)"

Finally, if this is all too annoying and you just want to get started, it might be easier to install MySQL on your system. In your prodigy.json, you can set "db": "mysql" and then use the "db_settings" to specify your username, database and password. See here for details.


(Amy Huntington) #5

conda list produced:

sqlite 3.26.0 he774522_0

Just added print(eg) to line 301 in virtual env version (thanks for the location script, that saved me some time!):

image

Initializing:

(spacy_env) C:\Users\ash9984>python -m prodigy terms.teach CAPS_terms en_core_web_lg --seeds "anxiety"
Initialising with 1 seed terms: anxiety
{'text': 'anxiety', 'answer': 'accept', '_input_hash': 6298237553007272678, '_task_hash': 6567957362502149308}

At 21 choices I get:

16:58:20 - Task queue depth is 1
{'text': 'depression', 'meta': {'score': 0.8066005695}, '_input_hash': 14703391357354852000, '_task_hash': 4519653915177422000, 'answer': 'accept'}
16:59:23 - Exception when serving /give_answers
Traceback (most recent call last):
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\waitress\channel.py", line 336, in service
    task.service()
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\waitress\task.py", line 175, in service
    self.execute()
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\waitress\task.py", line 452, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\api.py", line 423, in api_auto_instantiate
    return module.__hug_wsgi__(*args, **kwargs)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\falcon\api.py", line 244, in __call__
    responder(req, resp, **params)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py", line 793, in __call__
    raise exception
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py", line 766, in __call__
    self.render_content(self.call_function(input_parameters), context, request, response, **kwargs)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py", line 703, in call_function
    return self.interface(**parameters)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\hug\interface.py", line 100, in __call__
    return __hug_internal_self._function(*args, **kwargs)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\prodigy\app.py", line 173, in give_answers
    controller.receive_answers(answers, session_id=session_id)
  File "cython_src\prodigy\core.pyx", line 127, in prodigy.core.Controller.receive_answers
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\prodigy\components\db.py", line 304, in add_examples
    content=content)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py", line 4977, in create
    inst.save(force_insert=True)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py", line 5170, in save
    pk_from_cursor = self.insert(**field_dict).execute()
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py", line 3584, in execute
    cursor = self._execute()
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py", line 2939, in _execute
    return self.database.execute_sql(sql, params, self.require_commit)
  File "C:\Users\ash9984\AppData\Local\Continuum\anaconda3\envs\spacy_env\lib\site-packages\peewee.py", line 3830, in execute_sql
    cursor.execute(sql, params or ())
OverflowError: Python int too large to convert to SQLite INTEGER

Hope this helps!


(Matthew Honnibal) #6

Hi @Amy_H,

Thanks for raising this, it definitely looks like a bug. We first thought it was in SQLite, but the hash values printed there sure are bigger than 32bit, so it looks to me like the mmh3 dependency we’re using for this has an error somewhere that’s only being triggered on Windows, possibly only with specific compilation settings.

I had a look at the source for the mmh3_hash function we’re calling, which is here: https://github.com/hajimes/mmh3/blob/master/mmh3module.cpp#L29 . I can’t see the specific error, but it’s not hard to imagine this C code could be hitting a subtle error, perhaps around the sizing of the long or int variables.

In spaCy we’ve been using our own wrapper of MurmurHash: https://github.com/explosion/murmurhash . I think the best solution will be to add the function we need to this, and switch Prodigy over to this library.

The bad news is it’s pretty hard to offer you a mitigation that will let you keep working in the meantime. I’m working on this though — maybe I can figure something out.

As a sanity check as well, could you give the mmh3 version that you’ve got in your conda environment?


(Matthew Honnibal) #7

Okay, I think I have a plan that should let us get a mitigation in place. The plan is:

  1. I get a dev version of murmurhash up on PyPi with a replacement hash function
  2. You check that the replacement hash function doesn’t have the same bug
  3. If the replacement hash function works, then you uninstall mmh3, and we drop in a replacement file that just calls into murmurhash instead.

After we replace the mmh3 module, Prodigy should be none the wiser, and things should work correctly.

Could you try this for me?

python -m pip install murmurhash==1.1.0.dev0
python -c "import mmh3; print(mmh3.hash('anxiety'))"
python -c "import murmurhash; print(murmurhash.hash('anxiety'))"

We’re hoping that the call (into mmh3) produces the bad number that’s larger than 2**32, while the second call (into murmurhash) produces the correct result.

If this works, the next step is to uninstall mmh3, probably with something like conda uninstall mmh3. We then need to get Prodigy importing a replacement. I think it should be fine to create a file mmh3.py in your working directory, with the following contents:

from murmurhash import hash

Finally, run something like python -m prodigy to see if the replacement module is being imported correctly. If it’s not being found on the working directory, we might need to drop it into the conda virtualenv somewhere.


(Amy Huntington) #8

So sorry for the lag on this- work projects got crazy.

Here’s what happened:

Collecting murmurhash==1.1.0.dev0
  Downloading https://files.pythonhosted.org/packages/c2/77/585d84ef5f0423c0c1d5163bfe68b7b8b6df3ea074963acaa9c36c8eae60/murmurhash-1.1.0.dev0.tar.gz
Building wheels for collected packages: murmurhash
  Running setup.py bdist_wheel for murmurhash ... done
  Stored in directory: C:\Users\ash9984\AppData\Local\pip\Cache\wheels\66\0d\34\ac8bcbf74f7db9130ec060a0000e071a88af7635342b1a2ea6
Successfully built murmurhash
thinc 6.12.1 has requirement murmurhash<1.1.0,>=0.28.0, but you'll have murmurhash 1.1.0.dev0 which is incompatible.
spacy 2.0.16 has requirement murmurhash<1.1.0,>=0.28.0, but you'll have murmurhash 1.1.0.dev0 which is incompatible.
Installing collected packages: murmurhash
  Found existing installation: murmurhash 1.0.1
    Uninstalling murmurhash-1.0.1:
      Successfully uninstalled murmurhash-1.0.1
Successfully installed murmurhash-1.1.0.dev0

(spacy_env) C:\Users\ash9984>python -c "import mmh3; print(mmh3.hash('anxiety'))"
3518314200804635422

(spacy_env) C:\Users\ash9984>python -c "import murmurhash; print(murmurhash.hash('anxiety'))"
-1859125401

Does this look like what you were hoping for/expecting?

Thanks for taking the time Matt (and Ines!). Very impressed with the level of support I’ve already received!


(Amy Huntington) #9

I proceeded with your instructions:

(spacy_env) C:\Users\ash9984>conda uninstall mmh3
Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are missing from the target environment:
  - mmh3

(spacy_env) C:\Users\ash9984>python -m prodigy

  ?  Available recipes:
  ner.match, ner.teach, ner.manual, ner.make-gold, ner.eval, ner.eval-ab,
  ner.batch-train, ner.train-curve, ner.print-best, ner.print-stream,
  ner.print-dataset, ner.gold-to-spacy, ner.iob-to-gold

  textcat.teach, textcat.batch-train, textcat.train-curve, textcat.eval,
  textcat.print-stream, textcat.print-dataset

  dep.teach, dep.batch-train, dep.train-curve, compare, pos.teach,
  pos.make-gold, pos.batch-train, pos.train-curve, pos.gold-to-spacy,
  terms.train-vectors, terms.teach, terms.to-patterns, mark, image.manual,
  image.test


  ?  Available commands:
  dataset, drop, stats, pipe, db-in, db-out


(spacy_env) C:\Users\ash9984>python -m prodigy terms.teach CAPS_terms en_core_web_lg --seeds "anxiety"
Initialising with 1 seed terms: anxiety
{'text': 'anxiety', 'answer': 'accept', '_input_hash': -358848061, '_task_hash': -817917627}

  ?  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

At 21 selections it spits this out:

15:44:50 - Task queue depth is 1
{'text': 'depression', 'meta': {'score': 0.8066005695}, '_input_hash': 1147727864, '_task_hash': -2069453297, 'answer': 'accept'}
{'text': 'panic', 'meta': {'score': 0.7935524238}, '_input_hash': 898628126, '_task_hash': 1230930292, 'answer': 'accept'}
{'text': 'insomnia', 'meta': {'score': 0.7930415441}, '_input_hash': 1265564640, '_task_hash': -1287351673, 'answer': 'accept'}
{'text': 'nervousness', 'meta': {'score': 0.7917400127}, '_input_hash': -1688713714, '_task_hash': 863857025, 'answer': 'accept'}
{'text': 'stress', 'meta': {'score': 0.7867360162}, '_input_hash': -1711672912, '_task_hash': 2074732802, 'answer': 'accept'}
{'text': 'disorder', 'meta': {'score': 0.7840409770000001}, '_input_hash': 1533702439, '_task_hash': -1829597718, 'answer': 'accept'}
{'text': 'symptoms', 'meta': {'score': 0.7823031737}, '_input_hash': -2059339190, '_task_hash': 269617525, 'answer': 'accept'}
{'text': 'irritability', 'meta': {'score': 0.773979784}, '_input_hash': -1395564595, '_task_hash': -2002428720, 'answer': 'accept'}
{'text': 'disorders', 'meta': {'score': 0.7714788198}, '_input_hash': -129257277, '_task_hash': -394331402, 'answer': 'accept'}
{'text': 'pain', 'meta': {'score': 0.7712528458}, '_input_hash': -1599602915, '_task_hash': 1091723846, 'answer': 'accept'}

I think its fixed? This is good, right?


(Ines Montani) #10

Yes, if it doesn’t crash anymore, this indicates that the problem is resolved :tada:

(Basically, what went on here was that there’s likely a bug in the hashing library we use that creates the input hash and task hash values for each task saved to the database. The bug is only triggered in super specific conditions and platform combinations and you happened to be the unlucky person to trigger it for the first time ever in over a year :upside_down_face:)


(Amy Huntington) #11

Yeah, that tracks. :wink:

Thanks y’all.


(Rebekah Griesenauer) #12

Greetings, we recently installed prodigy with everything running smoothly. I started adapting this training video (https://prodi.gy/docs/video-insults-classifier) to an application relevant to our company. I got the same “Error: Couldn’t save annotations. Make sure the server is running correctly” error after exactly 21 session choices. I got this error after 21 annotations when using both SQLite and PostgreSQL. I followed the solution given by Matthew involving:

However, we got the same hash number when using both mmh3 and murmurhash:

Therefore, the work around created in this thread is not working for us. Could you please advise?

Thanks in advance,
Rebekah


(Matthew Honnibal) #13

@reb-greazy Hmm, that’s confusing!

Could you print the tasks which are failing, so we can see what the text and its hash is? We want to make sure there’s a hash that’s greater than 32bits there, and then verify what the hash value is when we use mmh3. If we’re not getting the same value by calling the library directly, that’s very confusing, and we’ll know where to look. If we do get a value greater than 32bits of out mmh3 and we also get that value out of murmurhash, then I’ll need to push a fix to that dev version of murmurhash.


(Rebekah Griesenauer) #14

Here is the call and initialization:

After 21 annotations, I get:

I hope this is helpful. Thanks in advance.


(Ines Montani) #15

Thanks for helping us debug this! :pray: Could you add the print statement to the recipe as discussed here?

This will give us the exact term it fails on when it’s trying to add it to the database.


(Rebekah Griesenauer) #16

Ines,

I added the print statement as instructed:
image

After additional testing, I noticed that I will get the same error (see below) every time I try saving from the web-application (even after just one annotation).

Here is the basic logging from my terminal:

Please let me know what additional info I can provide. Thanks so much for your help with this!


(Ines Montani) #17

Thanks so much! One thing that’s not 100% clear from your screenshot yet: What’s the last example it prints before the error occurs and the process dies? This should also be something like {'text': 'something', 'answer': '...'}. This will help us debug, because that example is the culprit.

This makes sense because once the server has died, all connections the app is trying to make fail.

The underlying error happens when Prodigy tries to save the example to the database. This doesn’t happen instantly – the app usually waits until it has one full batch of answers ready (minus the history, which is kept in the app so you can quickly undo). It then sends it out. A batch consists of 10 examples, so on the 21st annotation, Prodigy has one full batch of 10 answers plus 10 history. It sends the 10 back and the database bam, the error happens. This likely explains the “magic number” of 21.


(Rebekah Griesenauer) #18

Ines, sorry I am still pretty new at working with prodigy. I understand what you are saying regarding needing to know the last example. However, I added the print statement to the recipe but it seems to only print that information after the batch of 10 or right in the beginning as it is initializing the seeds. It continues to print in batches of 10 even when I am getting the saving error in the web application. However, I can find the 21st example by searching for the word it stopped on this session:

The 21st example in this session was the word “assumption” : {‘text’: ‘assumption’, ‘meta’: {‘score’: 0.7629092975159285}, ‘_input_hash’: -1957380416, ‘_task_hash’: -1661728975}


(Rebekah Griesenauer) #19

@ines Can I provide any additional information to help debug this problem?

Thanks!


(Ines Montani) #20

Sorry, think I forgot to reply to your previous comment!

Yeah, that makes sense, because the answers are also sent back in batches of 10. It loops over each example and adds it to the database – and on one of them, the SQLite database will eventually complain and the whole thing will fail (this doesn’t have to be the 21st example – it can be any example in the previous 10).

So just to confirm, is the last example you see before the error occurs the “assumption” example you posted?