terms.teach: OverflowError: Python int too large to convert to SQLite INTEGER

Hi @Amy_H,

Thanks for raising this, it definitely looks like a bug. We first thought it was in SQLite, but the hash values printed there sure are bigger than 32bit, so it looks to me like the mmh3 dependency we’re using for this has an error somewhere that’s only being triggered on Windows, possibly only with specific compilation settings.

I had a look at the source for the mmh3_hash function we’re calling, which is here: https://github.com/hajimes/mmh3/blob/master/mmh3module.cpp#L29 . I can’t see the specific error, but it’s not hard to imagine this C code could be hitting a subtle error, perhaps around the sizing of the long or int variables.

In spaCy we’ve been using our own wrapper of MurmurHash: https://github.com/explosion/murmurhash . I think the best solution will be to add the function we need to this, and switch Prodigy over to this library.

The bad news is it’s pretty hard to offer you a mitigation that will let you keep working in the meantime. I’m working on this though — maybe I can figure something out.

As a sanity check as well, could you give the mmh3 version that you’ve got in your conda environment?

Okay, I think I have a plan that should let us get a mitigation in place. The plan is:

  1. I get a dev version of murmurhash up on PyPi with a replacement hash function
  2. You check that the replacement hash function doesn’t have the same bug
  3. If the replacement hash function works, then you uninstall mmh3, and we drop in a replacement file that just calls into murmurhash instead.

After we replace the mmh3 module, Prodigy should be none the wiser, and things should work correctly.

Could you try this for me?

python -m pip install murmurhash==1.1.0.dev0
python -c "import mmh3; print(mmh3.hash('anxiety'))"
python -c "import murmurhash; print(murmurhash.hash('anxiety'))"

We’re hoping that the call (into mmh3) produces the bad number that’s larger than 2**32, while the second call (into murmurhash) produces the correct result.

If this works, the next step is to uninstall mmh3, probably with something like conda uninstall mmh3. We then need to get Prodigy importing a replacement. I think it should be fine to create a file mmh3.py in your working directory, with the following contents:

from murmurhash import hash

Finally, run something like python -m prodigy to see if the replacement module is being imported correctly. If it’s not being found on the working directory, we might need to drop it into the conda virtualenv somewhere.

So sorry for the lag on this- work projects got crazy.

Here’s what happened:

Collecting murmurhash==1.1.0.dev0
  Downloading https://files.pythonhosted.org/packages/c2/77/585d84ef5f0423c0c1d5163bfe68b7b8b6df3ea074963acaa9c36c8eae60/murmurhash-1.1.0.dev0.tar.gz
Building wheels for collected packages: murmurhash
  Running setup.py bdist_wheel for murmurhash ... done
  Stored in directory: C:\Users\ash9984\AppData\Local\pip\Cache\wheels\66\0d\34\ac8bcbf74f7db9130ec060a0000e071a88af7635342b1a2ea6
Successfully built murmurhash
thinc 6.12.1 has requirement murmurhash<1.1.0,>=0.28.0, but you'll have murmurhash 1.1.0.dev0 which is incompatible.
spacy 2.0.16 has requirement murmurhash<1.1.0,>=0.28.0, but you'll have murmurhash 1.1.0.dev0 which is incompatible.
Installing collected packages: murmurhash
  Found existing installation: murmurhash 1.0.1
    Uninstalling murmurhash-1.0.1:
      Successfully uninstalled murmurhash-1.0.1
Successfully installed murmurhash-1.1.0.dev0

(spacy_env) C:\Users\ash9984>python -c "import mmh3; print(mmh3.hash('anxiety'))"
3518314200804635422

(spacy_env) C:\Users\ash9984>python -c "import murmurhash; print(murmurhash.hash('anxiety'))"
-1859125401

Does this look like what you were hoping for/expecting?

Thanks for taking the time Matt (and Ines!). Very impressed with the level of support I’ve already received!

I proceeded with your instructions:

(spacy_env) C:\Users\ash9984>conda uninstall mmh3
Collecting package metadata: done
Solving environment: failed

PackagesNotFoundError: The following packages are missing from the target environment:
  - mmh3

(spacy_env) C:\Users\ash9984>python -m prodigy

  ?  Available recipes:
  ner.match, ner.teach, ner.manual, ner.make-gold, ner.eval, ner.eval-ab,
  ner.batch-train, ner.train-curve, ner.print-best, ner.print-stream,
  ner.print-dataset, ner.gold-to-spacy, ner.iob-to-gold

  textcat.teach, textcat.batch-train, textcat.train-curve, textcat.eval,
  textcat.print-stream, textcat.print-dataset

  dep.teach, dep.batch-train, dep.train-curve, compare, pos.teach,
  pos.make-gold, pos.batch-train, pos.train-curve, pos.gold-to-spacy,
  terms.train-vectors, terms.teach, terms.to-patterns, mark, image.manual,
  image.test


  ?  Available commands:
  dataset, drop, stats, pipe, db-in, db-out


(spacy_env) C:\Users\ash9984>python -m prodigy terms.teach CAPS_terms en_core_web_lg --seeds "anxiety"
Initialising with 1 seed terms: anxiety
{'text': 'anxiety', 'answer': 'accept', '_input_hash': -358848061, '_task_hash': -817917627}

  ?  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

At 21 selections it spits this out:

15:44:50 - Task queue depth is 1
{'text': 'depression', 'meta': {'score': 0.8066005695}, '_input_hash': 1147727864, '_task_hash': -2069453297, 'answer': 'accept'}
{'text': 'panic', 'meta': {'score': 0.7935524238}, '_input_hash': 898628126, '_task_hash': 1230930292, 'answer': 'accept'}
{'text': 'insomnia', 'meta': {'score': 0.7930415441}, '_input_hash': 1265564640, '_task_hash': -1287351673, 'answer': 'accept'}
{'text': 'nervousness', 'meta': {'score': 0.7917400127}, '_input_hash': -1688713714, '_task_hash': 863857025, 'answer': 'accept'}
{'text': 'stress', 'meta': {'score': 0.7867360162}, '_input_hash': -1711672912, '_task_hash': 2074732802, 'answer': 'accept'}
{'text': 'disorder', 'meta': {'score': 0.7840409770000001}, '_input_hash': 1533702439, '_task_hash': -1829597718, 'answer': 'accept'}
{'text': 'symptoms', 'meta': {'score': 0.7823031737}, '_input_hash': -2059339190, '_task_hash': 269617525, 'answer': 'accept'}
{'text': 'irritability', 'meta': {'score': 0.773979784}, '_input_hash': -1395564595, '_task_hash': -2002428720, 'answer': 'accept'}
{'text': 'disorders', 'meta': {'score': 0.7714788198}, '_input_hash': -129257277, '_task_hash': -394331402, 'answer': 'accept'}
{'text': 'pain', 'meta': {'score': 0.7712528458}, '_input_hash': -1599602915, '_task_hash': 1091723846, 'answer': 'accept'}

I think its fixed? This is good, right?

Yes, if it doesn’t crash anymore, this indicates that the problem is resolved :tada:

(Basically, what went on here was that there’s likely a bug in the hashing library we use that creates the input hash and task hash values for each task saved to the database. The bug is only triggered in super specific conditions and platform combinations and you happened to be the unlucky person to trigger it for the first time ever in over a year :upside_down_face:)

Yeah, that tracks. :wink:

Thanks y’all.

1 Like

Greetings, we recently installed prodigy with everything running smoothly. I started adapting this training video (https://prodi.gy/docs/video-insults-classifier) to an application relevant to our company. I got the same “Error: Couldn’t save annotations. Make sure the server is running correctly” error after exactly 21 session choices. I got this error after 21 annotations when using both SQLite and PostgreSQL. I followed the solution given by Matthew involving:

However, we got the same hash number when using both mmh3 and murmurhash:

Therefore, the work around created in this thread is not working for us. Could you please advise?

Thanks in advance,
Rebekah

@reb-greazy Hmm, that’s confusing!

Could you print the tasks which are failing, so we can see what the text and its hash is? We want to make sure there’s a hash that’s greater than 32bits there, and then verify what the hash value is when we use mmh3. If we’re not getting the same value by calling the library directly, that’s very confusing, and we’ll know where to look. If we do get a value greater than 32bits of out mmh3 and we also get that value out of murmurhash, then I’ll need to push a fix to that dev version of murmurhash.

Here is the call and initialization:

After 21 annotations, I get:

I hope this is helpful. Thanks in advance.

Thanks for helping us debug this! :pray: Could you add the print statement to the recipe as discussed here?

This will give us the exact term it fails on when it’s trying to add it to the database.

Ines,

I added the print statement as instructed:
image

After additional testing, I noticed that I will get the same error (see below) every time I try saving from the web-application (even after just one annotation).

Here is the basic logging from my terminal:

Please let me know what additional info I can provide. Thanks so much for your help with this!

Thanks so much! One thing that’s not 100% clear from your screenshot yet: What’s the last example it prints before the error occurs and the process dies? This should also be something like {'text': 'something', 'answer': '...'}. This will help us debug, because that example is the culprit.

This makes sense because once the server has died, all connections the app is trying to make fail.

The underlying error happens when Prodigy tries to save the example to the database. This doesn’t happen instantly – the app usually waits until it has one full batch of answers ready (minus the history, which is kept in the app so you can quickly undo). It then sends it out. A batch consists of 10 examples, so on the 21st annotation, Prodigy has one full batch of 10 answers plus 10 history. It sends the 10 back and the database bam, the error happens. This likely explains the “magic number” of 21.

Ines, sorry I am still pretty new at working with prodigy. I understand what you are saying regarding needing to know the last example. However, I added the print statement to the recipe but it seems to only print that information after the batch of 10 or right in the beginning as it is initializing the seeds. It continues to print in batches of 10 even when I am getting the saving error in the web application. However, I can find the 21st example by searching for the word it stopped on this session:

The 21st example in this session was the word “assumption” : {‘text’: ‘assumption’, ‘meta’: {‘score’: 0.7629092975159285}, ‘_input_hash’: -1957380416, ‘_task_hash’: -1661728975}

@ines Can I provide any additional information to help debug this problem?

Thanks!

Sorry, think I forgot to reply to your previous comment!

Yeah, that makes sense, because the answers are also sent back in batches of 10. It loops over each example and adds it to the database – and on one of them, the SQLite database will eventually complain and the whole thing will fail (this doesn’t have to be the 21st example – it can be any example in the previous 10).

So just to confirm, is the last example you see before the error occurs the “assumption” example you posted?

That is correct. Assumption is what it failed on.

This is confusing though – it looks to me like the hash values in the assumption example above are all valid 32 bit integers. We’re looking for values outside the range −2,147,483,648 to 2,147,483,647, which I don’t see in the examples you’ve posted?

I think we’ve taken a wrong turn in our debugging here, and we’re looking at a different problem that’s surfaced as a similar error message. I think the problem isn’t the hash values — it’s something else.

Perhaps try setting PRODIGY_LOGGING=verbose? We need to backtrack and look for a different problem. Sorry for the false start! Perhaps also paste the output of pip list so we can look at the versions?

I notice the error happened with both sqlite and postgres, which suggests it’s not the DB version. I’m not sure what might be wrong.

I have been paying attention to the hash values and have not noticed any larger than a valid 32 bit integer. Here is my pip list:


Here is an example of a prodigy session. I annotated 22 examples and then stopped the session.

@reb-greazy I think that’s the pip list for your system Python – it doesn’t have Prodigy installed, so it looks like you might have forgot to activate your virtualenv?

The other relevant platform details would be the OS you’re using, and the version of sqlite you have on your system.

Finally, the saving error you’re seeing: does it happen if you just annotate one example and then hit save (either click the disk icon or hit control+s)?

I wonder whether there’s some reason it can’t write to the DB, like a permissions error that’s getting swallowed. I think out of disk would raise something, but you could check that as well just for sanity?

Sorry this is still so unclear. I really don’t know what might be wrong.

In regards to @reb-greazy issue, we fixed it by adding a proxypass to the /give-answers location. Please see the full description here:

Thank you to @honnibal and @ines for all there help on this!