Crashing trying to load a new NER model.

I trained a new model, which crashes stacy.load() inside regex.compile

The pattern at fault is:

^§|^%|^=|^+(?![0-9])|^…|^……|^,|^:|^;|^!|^?|^¿|^؟|^¡|^(|^)|^[|^]|^{|^}|^<|^>|^_|^#|^*|^&|^。|^?|^!|^,|^、|^;|^:|^~|^·|^।|^،|^؛|^٪|^..+|^…|^’|^"|^”|^“|^`|^‘|^´|^’|^‚|^,|^„|^»|^«|^「|^」|^『|^』|^(|^)|^〔|^〕|^【|^】|^《|^》|^〈|^〉|^$|^£|^€|^¥|^฿|^US$|^C$|^A$|^₽|^﷼|^₴|^[¦©®°҂֍֎؎؏۞۩۽۾߶৺୰௳-௸௺౿൏൹༁-༃༓༕-༗༚-༟༴༶༸྾-࿅࿇-࿌࿎࿏࿕-࿘႞႟᎐-᎙᥀᧞-᧿᭡-᭪᭴-᭼℀℁℃-℆℈℉℔№℗℞-℣℥℧℩℮℺℻⅊⅌⅍⅏↊↋:arrow_up_down:-:arrow_lower_left:↜-↟↡↢↤↥↧-↭↯-⇍⇐⇑⇓⇕-⇳⌀-⌇⌌-⌟⌢-:keyboard:⌫-⍻⍽-⎚⎴-⏛⏢-␦⑀-⑊⒜-ⓩ─-:arrow_forward:▸-:arrow_backward:◂-◷:sunny:-♮♰-❧➔-:loop:⠀-⣿⬀-⬯⭅⭆⭍-⭳⭶-⮕⮘-⯈⯊-⯾⳥-⳪⺀-⺙⺛-⻳⼀-⿕⿰-⿻〄〒〓〠〶〷〾〿㆐㆑㆖-㆟㇀-㇣㈀-㈞㈪-㉇㉐㉠-㉿㊊-㊰㋀-㋾㌀-㏿䷀-䷿꒐-꓆꠨-꠫꠶꠷꠹꩷-꩹﷽¦│■○�𐄷-𐄿𐅹-𐆉𐆌-𐆎𐆐-𐆛𐆠𐇐-𐇼𐡷𐡸𐫈𑜿𖬼-𖬿𖭅𛲜𝀀-𝃵𝄀-𝄦𝄩-𝅘𝅥𝅲𝅪-𝅬𝆃𝆄𝆌-𝆩𝆮-𝇨𝈀-𝉁𝉅𝌀-𝍖𝠀-𝧿𝨷-𝨺𝩭-𝩴𝩶-𝪃𝪅𝪆𞲬🀀-🀫🀰-🂓🂠-🂮🂱-🂿🃁-:black_joker:🃑-🃵🄐-🅫:a:-🆬🇦-:sa:🈐-🈻🉀-🉈:ideograph_advantage::accept:🉠-🉥:cyclone:-:amphora::rat:-🛔:hammer_and_wrench:-:flight_arrival::artificial_satellite:-:skateboard:🜀-🝳🞀-🟘🠀-🠋🠐-🡇🡐-🡙🡠-🢇🢐-🢭🤀-🤋:zipper_mouth_face:-🤾:wilted_flower:-:smiling_face_with_three_hearts::partying_face:-:cold_face::pleading_face::lab_coat:-:swan:🦰-:supervillain::cheese:-:salt::face_with_monocle:-:nazar_amulet:🩠-🩭]

and the error message is: _regex_core.error: bad character range at position 522

The model was built in Prodigy (Mac, Python 3.6) and is being loaded into (Mac, Python 2.7, where my production stacy code lives).

Thoughts?

– Sean

Hi! This sounds like the model was trained with with a different version of spaCy. In spaCy v2.1, we made various improvements to the tokenization, which resulted in a 2-3x speedup. But it also means that models aren’t compatible between spaCy v2.0.x and v2.1.x. So you’d either have to retrain your model, or upgrade/downgrade spaCy. (You can run python -m spacy info in both your Prodigy and production environment to find out wihch versions you’re running where.)

Ah hah. Yup. Prodigy has 2.1.4, my production environment has 2.0.18.

But updating leads to

SystemError: [E130] You are running a narrow unicode build, which is incompatible with spacy >= 2.1.0. To fix this, reinstall Python and use a wide unicode build instead. You can also rebuild Python and set the --enable-unicode=ucs4 flag.

which is also an issue in production.

Are you able to install a wide-unicode build of Python, as described in the error message?

It’s a major change to the production stack, and the cost is not insignificant.

I’m suspecting that the change to UCS4 is part of the speed improvements you got for tokenization. If it’s possible, an alternate build for narrow unicode would be helpful. That’s mostly shifting my cost for testing up stream, but it might be shared across other users.

This is probably more significant on Mac, where we like to have a “framework” build and that makes it even more complicated to build for wide unicode.

@sean.true Actually spaCy hasn’t worked properly with a narrow unicode build for quite some time. It’s just that previously the errors would occur as wide unicode characters were processed, so you’d get unexpected results when parsing things like unicode characters.

If it’s really necessary, you should be able to achieve the same behaviour you had previously using the following steps:

  • Add a requirement for regex==2018.01.10`
  • Prior to importing spaCy, monkey-patch the re module such that the re.compile function is replaced by regex.compile.
  • Prior to importing spaCy, set sys.maxunicode = 0 to defeat the diagnostic check in spaCy’s __init__.py.

The following code is untested, but I think it should work:


import regex
import re
import sys

# Monkey-patch the re module, so that spaCy 
_compile = re.compile
re.compile = regex.compile
# Defeat diagnostic sys.maxunicode check in spacy's __init__.py
maxunicode = sys.maxunicode
sys.maxunicode = 0

import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("Hello world")

# Undo the monkeypatching
re.compile = _compile
sys.maxunicode = maxunicode

# Verify that text processing still works
doc = nlp("Hello world")

Of course, this isn’t a supported workflow — but I do expect you’ll be able to make something like this work to solve your immediate problem. I would definitely suggest just installing the wide unicode runtime, though.

Oh, what a lovely hack. Thank you!