I’ve been training a custom text-classification model and now want to package it up into a web service via Docker. To simplify the process of handling the model, I’ve packaged it up using python -m spacy package ...
and then followed the directions to run setup.py
to create a tar.gz
file.
When I install the wheel file (either locally or in Docker), then load the model, the returned document from the nlp
function does not contain any text categories. The model is named en_webcrawl_news_categories
and my code looks like this:
import spacy
import en_webcrawl_news_categories
nlp = en_webcrawl_news_categories()
doc = nlp(open('/tmp/article.txt').read())
doc.cats # returns an empty array
If I load the model in it’s unarchived format via the local filesystem, the text categorization works. So I went digging around in the file system where the trained model lives.
$> ls -l models
drwxr-xr-x 11 avollmer 1788769119 352 Apr 18 18:53 news_classification1/
drwxr-xr-x 12 avollmer 1788769119 384 Apr 24 11:00 news_classification2/
$> cd news_classification2
$> ls -l
total 104880
drwxr-xr-x 8 avollmer 1788769119 256 Apr 24 11:01 en_webcrawl_news_categories-0.9.0/
-rw-r--r-- 1 avollmer 1788769119 19248727 Apr 22 13:42 evaluation.jsonl
-rw-r--r-- 1 avollmer 1788769119 983 Apr 22 13:42 meta.json
drwxr-xr-x 7 avollmer 1788769119 224 Apr 22 13:42 ner/
drwxr-xr-x 7 avollmer 1788769119 224 Apr 22 13:42 parser/
drwxr-xr-x 5 avollmer 1788769119 160 Apr 22 13:42 tagger/
drwxr-xr-x 4 avollmer 1788769119 128 Apr 22 13:42 textcat/
-rw-r--r-- 1 avollmer 1788769119 57475 Apr 22 13:42 tokenizer
-rw-r--r-- 1 avollmer 1788769119 76727077 Apr 22 13:42 training.jsonl
drwxr-xr-x 6 avollmer 1788769119 192 Apr 22 13:42 vocab/
$> cd en_webcrawl_news_categories-0.9.0
$> ls -l
total 12
-rw-r--r-- 1 avollmer 1788769119 17 Apr 24 11:00 MANIFEST.in
drwxr-xr-x 3 avollmer 1788769119 96 Apr 24 11:01 dist/
drwxr-xr-x 5 avollmer 1788769119 160 Apr 24 11:01 en_webcrawl_news_categories/
drwxr-xr-x 8 avollmer 1788769119 256 Apr 24 11:01 en_webcrawl_news_categories.egg-info/
-rw-r--r-- 1 avollmer 1788769119 883 Apr 24 11:00 meta.json
-rw-r--r-- 1 avollmer 1788769119 1684 Apr 24 11:00 setup.py
$> ls -l en_webcrawl_news_categories
total 8
-rw-r--r-- 1 avollmer 1788769119 291 Apr 24 11:00 __init__.py
drwxr-xr-x 10 avollmer 1788769119 320 Mar 28 09:17 en_webcrawl_news_categories-0.9.0/
-rw-r--r-- 1 avollmer 1788769119 883 Apr 24 11:01 meta.json
$> ls -l en_webcrawl_news_categories/en_webcrawl_news_categories-0.9.0
total 2244
-rw-r--r-- 1 avollmer 1788769119 1104358 Mar 28 09:17 evaluation.jsonl
-rw-r--r-- 1 avollmer 1788769119 883 Apr 24 11:01 meta.json
drwxr-xr-x 7 avollmer 1788769119 224 Mar 28 09:17 ner/
drwxr-xr-x 7 avollmer 1788769119 224 Mar 28 09:17 parser/
drwxr-xr-x 5 avollmer 1788769119 160 Mar 28 09:17 tagger/
-rw-r--r-- 1 avollmer 1788769119 57475 Mar 28 09:17 tokenizer
-rw-r--r-- 1 avollmer 1788769119 1126164 Mar 28 09:17 training.jsonl
drwxr-xr-x 6 avollmer 1788769119 192 Mar 28 09:17 vocab/
Note that in the first directory listing I see a textcat
directory, but not in the inner packaged folder (the last directory listing above).
From the project directory (in which models
is a sub-directory) I ran the following command to package the model:
python -m spacy package model ./models/news_classification2 --create-meta
My spaCy installation is:
python -m spacy info
Info about spaCy
spaCy version 2.0.18
Location /Users/avollmer/Development/spacy-ner/.venv/lib/python3.7/site-packages/spacy
Platform Darwin-17.7.0-x86_64-i386-64bit
Python version 3.7.2
Models en_core_web_md, en_core_web_lg, en_core_web_sm
Any ideas on how to get the textcat
portion of the model into the dist package?