Text classification packaging issues

I’ve been training a custom text-classification model and now want to package it up into a web service via Docker. To simplify the process of handling the model, I’ve packaged it up using python -m spacy package ... and then followed the directions to run setup.py to create a tar.gz file.

When I install the wheel file (either locally or in Docker), then load the model, the returned document from the nlp function does not contain any text categories. The model is named en_webcrawl_news_categories and my code looks like this:

import spacy
import en_webcrawl_news_categories

nlp = en_webcrawl_news_categories()
doc = nlp(open('/tmp/article.txt').read())
doc.cats # returns an empty array

If I load the model in it’s unarchived format via the local filesystem, the text categorization works. So I went digging around in the file system where the trained model lives.

$> ls -l models
drwxr-xr-x 11 avollmer 1788769119 352 Apr 18 18:53 news_classification1/
drwxr-xr-x 12 avollmer 1788769119 384 Apr 24 11:00 news_classification2/

$> cd news_classification2
$> ls -l
total 104880
drwxr-xr-x 8 avollmer 1788769119      256 Apr 24 11:01 en_webcrawl_news_categories-0.9.0/
-rw-r--r-- 1 avollmer 1788769119 19248727 Apr 22 13:42 evaluation.jsonl
-rw-r--r-- 1 avollmer 1788769119      983 Apr 22 13:42 meta.json
drwxr-xr-x 7 avollmer 1788769119      224 Apr 22 13:42 ner/
drwxr-xr-x 7 avollmer 1788769119      224 Apr 22 13:42 parser/
drwxr-xr-x 5 avollmer 1788769119      160 Apr 22 13:42 tagger/
drwxr-xr-x 4 avollmer 1788769119      128 Apr 22 13:42 textcat/
-rw-r--r-- 1 avollmer 1788769119    57475 Apr 22 13:42 tokenizer
-rw-r--r-- 1 avollmer 1788769119 76727077 Apr 22 13:42 training.jsonl
drwxr-xr-x 6 avollmer 1788769119      192 Apr 22 13:42 vocab/

$> cd en_webcrawl_news_categories-0.9.0
$> ls -l
total 12
-rw-r--r-- 1 avollmer 1788769119   17 Apr 24 11:00 MANIFEST.in
drwxr-xr-x 3 avollmer 1788769119   96 Apr 24 11:01 dist/
drwxr-xr-x 5 avollmer 1788769119  160 Apr 24 11:01 en_webcrawl_news_categories/
drwxr-xr-x 8 avollmer 1788769119  256 Apr 24 11:01 en_webcrawl_news_categories.egg-info/
-rw-r--r-- 1 avollmer 1788769119  883 Apr 24 11:00 meta.json
-rw-r--r-- 1 avollmer 1788769119 1684 Apr 24 11:00 setup.py

$> ls -l en_webcrawl_news_categories
total 8
-rw-r--r--  1 avollmer 1788769119 291 Apr 24 11:00 __init__.py
drwxr-xr-x 10 avollmer 1788769119 320 Mar 28 09:17 en_webcrawl_news_categories-0.9.0/
-rw-r--r--  1 avollmer 1788769119 883 Apr 24 11:01 meta.json

$> ls -l en_webcrawl_news_categories/en_webcrawl_news_categories-0.9.0
total 2244
-rw-r--r-- 1 avollmer 1788769119 1104358 Mar 28 09:17 evaluation.jsonl
-rw-r--r-- 1 avollmer 1788769119     883 Apr 24 11:01 meta.json
drwxr-xr-x 7 avollmer 1788769119     224 Mar 28 09:17 ner/
drwxr-xr-x 7 avollmer 1788769119     224 Mar 28 09:17 parser/
drwxr-xr-x 5 avollmer 1788769119     160 Mar 28 09:17 tagger/
-rw-r--r-- 1 avollmer 1788769119   57475 Mar 28 09:17 tokenizer
-rw-r--r-- 1 avollmer 1788769119 1126164 Mar 28 09:17 training.jsonl
drwxr-xr-x 6 avollmer 1788769119     192 Mar 28 09:17 vocab/

Note that in the first directory listing I see a textcat directory, but not in the inner packaged folder (the last directory listing above).

From the project directory (in which models is a sub-directory) I ran the following command to package the model:

python -m spacy package model ./models/news_classification2 --create-meta

My spaCy installation is:

python -m spacy info

    Info about spaCy

    spaCy version      2.0.18
    Location           /Users/avollmer/Development/spacy-ner/.venv/lib/python3.7/site-packages/spacy
    Platform           Darwin-17.7.0-x86_64-i386-64bit
    Python version     3.7.2
    Models             en_core_web_md, en_core_web_lg, en_core_web_sm

Any ideas on how to get the textcat portion of the model into the dist package?

Hi! Your workflow looks good and it does seem like the issue here is that your package model doesn’t contain the textcat component. What does your meta.json look like?

I also noticed that when you ran spacy package, you set the --create-meta flag. Did you specify the model pipeline correctly here? Maybe you accidentally hit enter on the default pipeline?

The meta.json in the model’s directory looks like this:

{
  "lang":"en",
  "pipeline":[
    "tagger",
    "parser",
    "ner",
    "textcat"
  ],
  "accuracy":{
    "token_acc":99.8609203297,
    "ents_p":85.6222222222,
    "ents_r":86.1342424412,
    "uas":91.7226238412,
    "tags_acc":97.1103903351,
    "ents_f":85.8774691444,
    "las":89.838878402
  },
  "name":"core_web_md",
  "license":"CC BY-SA 3.0",
  "author":"Explosion AI",
  "url":"https://explosion.ai",
  "vectors":{
    "width":300,
    "vectors":20000,
    "keys":684830,
    "name":"en_core_web_md.vectors"
  },
  "sources":[
    "OntoNotes 5",
    "Common Crawl"
  ],
  "version":"2.0.0",
  "spacy_version":">=2.0.0",
  "parent_package":"spacy",
  "speed":{
    "gpu":null,
    "nwords":291344,
    "cpu":4532.930431222
  },
  "email":"contact@explosion.ai",
  "description":"English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities."
}

The meta.json in the package sub-directory looks like this:

{
  "lang":"en",
  "pipeline":[
    "sbd",
    "tagger",
    "parser",
    "ner"
  ],
  "accuracy":{
    "token_acc":99.8698372794,
    "ents_p":84.9664503965,
    "ents_r":85.6312524451,
    "uas":91.7237657538,
    "tags_acc":97.0403350292,
    "ents_f":85.2975560875,
    "las":89.800872413
  },
  "name":"webcrawl_news_categories",
  "license":"NONE",
  "author":"Alex Vollmer",
  "url":"https://www.summitpartners.com",
  "vectors":{
    "width":0,
    "vectors":0,
    "keys":0
  },
  "sources":[
    "OntoNotes 5",
    "Common Crawl"
  ],
  "version":"0.9.0",
  "spacy_version":">=2.0.18,<3.0.0",
  "parent_package":"spacy",
  "speed":{
    "gpu":null,
    "nwords":291344,
    "cpu":5122.3040471407
  },
  "email":"avollmer@summitpartners.com",
  "description":"Top-level news classification into \"newspaper sections\""
}

I ran spacy package with the --create-meta flag because otherwise I just got a model named en_core_web_md-2.0.0, which is what my custom model is built off of. Maybe I’m making some other mistake earlier in the workflow?

Thanks for the details. I just double-checked and when you run the command with --create-meta, it should automatically read the pipeline information off the model you’re packaging.

So what happens when you run the following and load the spaCy model from a path? Is the pipeline of that model correct?

nlp = spacy.load('./models/news_classification2')
print(nlp.pipe_names)

Thanks for the response. When I fire up ipython and run the code you suggested, I get this:

[ins] In [1]: import spacy

[ins] In [2]: nlp = spacy.load('./models/news_classification2')

[ins] In [3]: print(nlp.pipe_names)
['tagger', 'parser', 'ner', 'textcat']

I just looked at the file system overview you posted again and… I think you might have actually ended up in a slightly confusing state. Your news_classification2 directory does have all the model components – but it also has this packaged en_webcrawl_news_categories-0.9.0 with another nested model, which shouldn’t really be there.

Could you try starting off fresh with just the following files and directories? (You could remove the training and evaluation files if you want – that’s kinda up to you.)

-rw-r--r-- 1 avollmer 1788769119 19248727 Apr 22 13:42 evaluation.jsonl
-rw-r--r-- 1 avollmer 1788769119      983 Apr 22 13:42 meta.json
drwxr-xr-x 7 avollmer 1788769119      224 Apr 22 13:42 ner/
drwxr-xr-x 7 avollmer 1788769119      224 Apr 22 13:42 parser/
drwxr-xr-x 5 avollmer 1788769119      160 Apr 22 13:42 tagger/
drwxr-xr-x 4 avollmer 1788769119      128 Apr 22 13:42 textcat/n ()
-rw-r--r-- 1 avollmer 1788769119    57475 Apr 22 13:42 tokenizer
-rw-r--r-- 1 avollmer 1788769119 76727077 Apr 22 13:42 training.jsonl
drwxr-xr-x 6 avollmer 1788769119      192 Apr 22 13:42 vocab/

Next, just to be double safe, edit the meta.json and make sure it has the correct pipeline listed (including textcat). Then, edit it to reflect your desired name and other meta settings. Finally, run spacy package with that directory again, make sure to save out the result in a separate location and then package the sdist.

Thanks for your help. That seemed to fix things. I’m not sure how that happened, but things seem to be back on track.

1 Like