Model working in jupyter is poor in other environments

We have a model which works reasonably well in a jupyter notebook (a reasonable starting point for improvement)
But the same model, with the same input performs really badly when we use it outside of jupyter

# result contains text

def makedoc(result, model):
    nlp = spacy.load(model)
    doc = nlp(result)
    return doc

doc = makedoc(result, model)

print("entities for ner")
for ent in doc.ents:
    print(ent.label_, ent.text)

Different people on different machines (windows and mac) have successfully used the model on jupyter (jupyter notebook, in vscode, in pycharm). Jupyter chose the spacy version and it has worked with spacy 3.6.1 and 3.7.2.

When we set up a poetry environment for easier experimentation, everything works (annotate, train, extracting text from pdfs, using model). But the returned entities are much much worse (not a good starting point for model development)

I have tried quite a few different spacy and python versions in case there was some incompatibility. In general spacy is a later version eg 3.7.4, but we even tied the spacy version back to 3.6.1, and got the same poor results.

In case it was a poetry problem, I just ran the code in a .py file
(interpreter was ~/miniconda3/lib/python3.9, spacy was 3.7.2)
We got the same poor results as when using poetry

Question 1 – do we have any known incompatibilities between spacy versions and python versions? (I understand the numpy 2.0.0 problem.)
Question 2 – are there issues with a miniconda environment
Question 3 – any known incompatibilities with poetry
Question 4- is there something else missing that jupyter would provide eg any spacy dependencies I should have included

Any other thoughts on why I get better results in a jupyter notebook or anything else I might have forgotten to control for?

This is quite strange, it's not immediately clear to me what the problem could be. My best guess is that somehow different numeric routines are being called and this results in incorrect calculations.

Can you post the pip list, Python version, CPU type in both situations?

To bring some sanity to this, there's definitely nothing about jupyter specifically that could be causing this. It's more about just the stack being installed in a different environment, and this somehow resulting in different calculations.

Thank you for your help

To keep it simple I am going start by to comparing spacy info on jupyter notebook and poetry information on my mac (but as I said, jupyter on windows worked and poetry on windows did not)

Using jupyter

using peotry

An immediate difference I see via spacy info, is that Poetry is not accessing a pipeline. Any thoughts on how to fix that?

(This may not be the complete answer, because my file, not in poetry and seemingly with the same piplist as jupyter also gets bad results.)

by the CPU do you mean the chip? On the Mac its it Apple M1

piplist coming next

piplist from jupyter:

affine 2.3.1
aiofiles 23.2.1
annotated-types 0.6.0
anyio 3.6.2
appdirs 1.4.4
appnope 0.1.3
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
astor 0.8.1
asttokens 2.1.0
attrs 22.1.0
backcall 0.2.0
backports.functools-lru-cache 1.6.4
beautifulsoup4 4.11.1
bleach 5.0.1
blis 0.7.11
bokeh 3.3.1
branca 0.5.0
Brotli 1.1.0
brotlipy 0.7.0
cachetools 5.3.3
cairocffi 1.6.1
CairoSVG 2.7.1
Cartopy 0.21.0
catalogue 2.0.10
certifi 2022.9.24
cffi 1.15.0
chardet 5.2.0
charset-normalizer 2.0.4
click 8.1.3
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.16.0
colorama 0.4.6
commonmark 0.9.1
conda 22.9.0
conda-content-trust 0+unknown
conda-package-handling 1.8.1
confection 0.1.3
contextily 1.2.0
contourpy 1.0.6
cryptography 37.0.1
cssselect2 0.7.0
cycler 0.11.0
cymem 2.0.8
debugpy 1.6.3
decorator 5.1.1
defusedxml 0.7.1
drawsvg 2.3.0
en-core-web-sm 3.7.0
entrypoints 0.4
et-xmlfile 1.1.0
executing 1.2.0
fastapi 0.102.0
fastjsonschema 2.16.2
Fiona 1.8.22
flit_core 3.7.1
folium 0.13.0
fonttools 4.38.0
GDAL 3.5.3
geographiclib 2.0
geopandas 0.12.1
geoplot 0.5.1
geopy 2.3.0
h11 0.14.0
html5lib 1.1
idna 3.3
importlib-metadata 5.0.0
importlib-resources 5.10.0
iopath 0.1.10
ipykernel 6.17.0
ipython 8.6.0
ipython-genutils 0.2.0
jedi 0.18.1
Jinja2 3.1.2
joblib 1.2.0
jsonschema 4.17.0
jupyter_client 7.4.4
jupyter_core 4.11.2
jupyter-server 1.21.0
jupyterlab-pygments 0.2.2
kiwisolver 1.4.4
langcodes 3.3.0
layoutparser 0.3.4
mapclassify 2.4.3
MarkupSafe 2.1.1
matplotlib 3.6.2
matplotlib-inline 0.1.6
mercantile 1.2.1
mistune 2.0.4
munch 2.5.0
munkres 1.1.4
murmurhash 1.0.10
nbclassic 0.4.8
nbclient 0.7.0
nbconvert 7.2.3
nbformat 5.7.0
nest-asyncio 1.5.6
networkx 2.8.8
notebook 6.5.2
notebook_shim 0.2.2
numpy 1.23.4
openpyxl 3.0.10
osmnx 1.2.2
packaging 21.3
pandas 1.5.1
pandasgui 0.2.14
pandocfilters 1.5.0
parso 0.8.3
pdf2image 1.17.0
pdfminer.six 20231228
pdfplumber 0.11.0
peewee 3.16.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.2.0
pip 24.0
pkgutil_resolve_name 1.3.10
plotly 5.18.0
portalocker 2.8.2
preshed 3.0.9
prodigy 1.15.2
prometheus-client 0.15.0
prompt-toolkit 3.0.31
psutil 5.9.3
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 14.0.1
pycosat 0.6.3
pycparser 2.21
pydantic 2.4.2
pydantic_core 2.10.1
pydyf 0.8.0
Pygments 2.13.0
PyJWT 2.8.0
pynput 1.7.6
pyobjc-core 10.0
pyobjc-framework-ApplicationServices 10.0
pyobjc-framework-Cocoa 10.0
pyobjc-framework-Quartz 10.0
pyogrio 0.4.2
pyOpenSSL 22.0.0
pyparsing 3.0.9
pypdfium2 4.28.0
pyphen 0.14.0
pyproj 3.4.0
PyQt5 5.15.10
PyQt5-Qt5 5.15.11
PyQt5-sip 12.13.0
PyQtWebEngine 5.15.6
PyQtWebEngine-Qt5 5.15.11
pyrsistent 0.19.1
pyshp 2.3.1
PySocks 1.7.1
python-ags4 0.4.1
python-dateutil 2.8.2
python-dotenv 1.0.1
pytz 2022.6
PyYAML 6.0.1
pyzmq 24.0.1
qtstylish 0.1.5
radicli 0.0.25
rasterio 1.3.4
rasterstats 0.17.0
reportlab 4.0.7
requests 2.28.1
rich 10.16.2
Rtree 1.0.1
ruamel-yaml-conda 0.15.100
scikit-learn 1.1.3
scipy 1.9.3
seaborn 0.12.1
Send2Trash 1.8.0
setuptools 61.2.0
Shapely 1.8.5.post1
simplejson 3.18.0
six 1.16.0
smart-open 6.4.0
sniffio 1.3.0
snuggs 1.4.7
soupsieve 2.3.2.post1
spacy 3.7.2
spacy-legacy 3.0.12
spacy-llm 0.7.1
spacy-loggers 1.0.5
spacypdfreader 0.3.1
srsly 2.4.8
stack-data 0.6.0
starlette 0.27.0
striplog 0.9.2
svgwrite 1.4.3
tenacity 8.2.3
terminado 0.17.0
thinc 8.2.1
threadpoolctl 3.1.0
tinycss2 1.2.1
toolz 0.12.0
tornado 6.2
tqdm 4.64.0
traitlets 5.5.0
typeguard 3.0.2
typer 0.9.0
typing_extensions 4.8.0
unicodedata2 15.0.0
urllib3 1.26.9
uvicorn 0.26.0
wasabi 1.1.2
wcwidth 0.2.5
weasel 0.3.4
weasyprint 60.1
webencodings 0.5.1
websocket-client 1.4.1
wheel 0.37.1
wordcloud 1.9.2
xyzservices 2022.9.0
yellowbrick 1.5
zipp 3.10.0
zopfli 0.2.3

Many many thanks for your help. Its so wierd to have a model that works, but be unable to replicate the model outputs outside of jupyter. Using jupyter on 3 different machines, 2 mac, 1 windows, the model gives very consistent outputs. In poetry (and also calling my file outside of peotry from the terminal) I get outputs, but they are vastly inferior.

piplist from poetry

accelerate 0.31.0
aiofiles 24.1.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.4.0
async-timeout 4.0.3
attrs 23.2.0
blis 0.7.11
cachetools 5.3.3
catalogue 2.0.10
certifi 2024.6.2
charset-normalizer 3.3.2
click 8.1.7
cloudpathlib 0.18.1
confection 0.1.5
cymem 2.0.8
datasets 2.20.0
dill 0.3.8
exceptiongroup 1.2.1
fastapi 0.102.0
filelock 3.15.4
frozenlist 1.4.1
fsspec 2024.5.0
h11 0.14.0
huggingface-hub 0.23.4
idna 3.7
importlib_metadata 8.0.0
Jinja2 3.1.4
joblib 1.4.2
langcodes 3.4.0
language_data 1.2.0
marisa-trie 1.2.0
MarkupSafe 2.1.5
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
murmurhash 1.0.10
networkx 3.2.1
numpy 1.26.4
packaging 24.1
pandas 2.2.2
pathlib_abc 0.1.1
pathy 0.11.0
peewee 3.16.3
Pillow 9.4.0
pip 24.0
preshed 3.0.9
prodigy 1.15.2
prodigy_pdf 0.2.1
psutil 6.0.0
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
PyJWT 2.8.0
PyMuPDF 1.24.5
PyMuPDFb 1.24.3
pypdfium2 4.20.0
pytesseract 0.3.10
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
pytz 2024.1
PyYAML 6.0.1
radicli 0.0.25
regex 2024.5.15
requests 2.32.3
safetensors 0.4.3
scikit-learn 1.5.0
scipy 1.13.1
seqeval 1.2.2
setuptools 70.1.0
six 1.16.0
smart-open 6.4.0
sniffio 1.3.1
spacy 3.6.1
spacy-legacy 3.0.12
spacy-llm 0.7.2
spacy-loggers 1.0.5
srsly 2.4.8
starlette 0.27.0
sympy 1.12.1
thinc 8.1.12
threadpoolctl 3.5.0
tokenizers 0.19.1
toolz 0.12.1
torch 2.3.1
tqdm 4.66.4
transformers 4.41.2
typeguard 3.0.2
typer 0.9.4
typing_extensions 4.12.2
tzdata 2024.1
urllib3 2.2.2
uvicorn 0.26.0
wasabi 1.1.3
xxhash 3.4.1
yarl 1.9.4
zipp 3.19.2

Okay yeah if the poetry installation isn't accessing a pipeline it wouldn't be predicting anything. So you'll need to get the pipelines installed in your poetry environment.

I know there's a lot of different workflows for Python packaging and environment management, and none are really ideal. But here's my opinionated take on it.

I would recommend keeping the environment management and tooling as bare-bones and simple as possible, which is why personally I don't recommend Poetry. Some of this might be well known to you, but at the heart of it all when you type python or pip what's happening is your PATH variable is consulted, which has a list of directories in order. It goes down the list and finds a directory that has that executable name, and that's what's executed. There can be some additional complications with caching.

The fewer competing things you have messing with your PATH the better in my opinion, because you end up with less surprising results. Tools like Poetry give you some convenience by giving you a virtual environment that's sort of invisible to you, e.g. the active virtual environment depends on which directory you cd into. However, you then have this extra complexity and indirection.

What I like to do is have an explicit directory called something like env3 within the project directory. I create this with a command like python3.10 -m venv env3, and then I do something like source env3/bin/activate and python -m pip install -r requirements.txt. If the state gets messed up (like I'm confused about what should be installed), I simply delete this directory and recreate it. At all times I know I can do something like ls env3/lib/python3.10/site-packages and I can see what's there, sometimes even adding debug prints to third-party libraries if I'm trying to track something down.

All that said, migrating workflows can definitely be a hassle, and if your team is using Poetry I'm definitely not saying you need to switch. I just wanted to give some background as I understand environment management in Python can be quite painful, with a lot of conflicting advice.

Underneath it all, if you can run something like spacy download en_core_web_sm that ought to install the pipeline package for you. You'll just need to make sure the spacy you're executing will actually be the one within the right Poetry environment. That can get tricky --- it can depend on how poetry has manipulated your PATH, and how that interacts with your shell's caching. You can do which spacy to make sure the right thing is happening. Running python -m spacy instead of spacy can sometimes help as well.

The state you're trying to achieve is the en_core_web_sm package ending up in that site-packages directory of your environment. You can also load spaCy model packages from disk, instead of as a pip-installed package --- but it shouldn't be so difficult to have it as a Python package, so let's go with that.

Thank you. Putting it in simple venv as you suggested enabled us to work through the spacy pipeline issues. And now it all works fine in Poetry too. Many thanks for your input. solved

1 Like