Outdated Documentation and trouble loading textcat model

Stephan · August 15, 2018, 9:24am

Hey there,

I slightly modified the textcat.batch-train recipy to save the best_model as a string with nlp.to_bytes() as I only want to store the bare minimum of this model in a cloud storage bucket.

But now I struggle reading it back. When I do

nlp = spacy.blank('en', pipeline=[])
with open('/Users/dedan/Downloads/my-test-file.spacymodel', 'rb') as f:
    nlp = nlp.from_bytes(f.read())

Then nlp.pip_names is still empty.

I also tried examples form the TextCategorizer docs, but no success here either.

nlp = spacy.blank('en', pipeline=[])
textcat = TextCategorizer(nlp.vocab)
with open('/Users/dedan/Downloads/my-test-file.txt', 'rb') as f:
    textcat.from_bytes(f.read())
textcat('This is a sentence')

Throws an error TypeError: 'bool' object is not callable, which lead me to this issue, but the documentation still seems to be outdated.

I feel like I’m missing something totally obvious, can you please help me with that?

ines · August 15, 2018, 11:48am

Something like this can't work, because a pipeline component always expects a spaCy Doc object, not a string of text. I'd also generally recommend adding components to the pipeline first and calling them via the nlp object instead of directly on the Doc.

If you just want to save out the byte string, you probably also want to save out the nlp.meta as well (a dictionary that's usually saved out as the meta.json). This lets you reconstruct the full model and pipeline:

# let's assume `meta` is the meta dict and `byte_string` your data

nlp = spacy.blank(meta['lang'])
for pipe_name in meta['pipeline']:
    pipe = nlp.create_pipe(pipe_name)
    nlp.add_pipe(pipe)
nlp.from_bytes(byte_string)

This is almost exactly what happens behind the scenes when you call spacy.load.

Stephan · August 15, 2018, 12:04pm

Hey Ines,

as always: thank you very much for the fast help!

Constructing the pipelines from meta.json works!

Maybe you can document this somewhere, I could not find general information on how to serialize and load models in the documentation.

ines · August 15, 2018, 12:11pm

Sure! Where would you have expected to find this info? Like, which page or section in the docs?

(I think here might be a good place – but don’t click that link before you’ve thought about where you would have expected it, because I’m super interested in that and don’t want to bias your feedback )

Stephan · August 15, 2018, 12:16pm

Ok, I thought about it before clicking and thought about here: https://spacy.io/api/top-level

And now knowing more than before, I realized that there you already describe how spacy.load works, I must have overlooked that before. Or I just did not understand it.

The location where you suggested is great I think.

My main problem was that I did not understand what information is actually contained in that byte string. Is it all the vocabulary, or only the weights of my textcat model. I did not think that it contains information about multiple pipes and that I need the information from meta to first reconstruct the structure of a pipeline before I can load the model.

Does that help?

Topic		Replies	Views
Load error after adding custom textcat model to the pipeline textcat , spacy	7	2081	June 26, 2019
strings.json can't be read for text cat usage , textcat , spacy , solved	2	562	March 26, 2020
Error while loading the custom Text classification model in python textcat , spacy	1	811	June 20, 2019
How to use a (sentence targeted) textcat model together with the core model textcat , spacy	2	1342	November 28, 2017
Save trained model and add to a pretrained model usage , textcat , spacy , solved	4	1507	September 19, 2019

Outdated Documentation and trouble loading textcat model

Related topics