Hi! This sounds like the model was trained with with a different version of spaCy. In spaCy v2.1, we made various improvements to the tokenization, which resulted in a 2-3x speedup. But it also means that models aren’t compatible between spaCy v2.0.x and v2.1.x. So you’d either have to retrain your model, or upgrade/downgrade spaCy. (You can run python -m spacy info in both your Prodigy and production environment to find out wihch versions you’re running where.)
Ah hah. Yup. Prodigy has 2.1.4, my production environment has 2.0.18.
But updating leads to
SystemError: [E130] You are running a narrow unicode build, which is incompatible with spacy >= 2.1.0. To fix this, reinstall Python and use a wide unicode build instead. You can also rebuild Python and set the --enable-unicode=ucs4 flag.
It’s a major change to the production stack, and the cost is not insignificant.
I’m suspecting that the change to UCS4 is part of the speed improvements you got for tokenization. If it’s possible, an alternate build for narrow unicode would be helpful. That’s mostly shifting my cost for testing up stream, but it might be shared across other users.
This is probably more significant on Mac, where we like to have a “framework” build and that makes it even more complicated to build for wide unicode.
@sean.true Actually spaCy hasn’t worked properly with a narrow unicode build for quite some time. It’s just that previously the errors would occur as wide unicode characters were processed, so you’d get unexpected results when parsing things like unicode characters.
If it’s really necessary, you should be able to achieve the same behaviour you had previously using the following steps:
Add a requirement for regex==2018.01.10`
Prior to importing spaCy, monkey-patch the re module such that the re.compile function is replaced by regex.compile.
Prior to importing spaCy, set sys.maxunicode = 0 to defeat the diagnostic check in spaCy’s __init__.py.
The following code is untested, but I think it should work:
import regex
import re
import sys
# Monkey-patch the re module, so that spaCy
_compile = re.compile
re.compile = regex.compile
# Defeat diagnostic sys.maxunicode check in spacy's __init__.py
maxunicode = sys.maxunicode
sys.maxunicode = 0
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("Hello world")
# Undo the monkeypatching
re.compile = _compile
sys.maxunicode = maxunicode
# Verify that text processing still works
doc = nlp("Hello world")
Of course, this isn’t a supported workflow — but I do expect you’ll be able to make something like this work to solve your immediate problem. I would definitely suggest just installing the wide unicode runtime, though.