Outdated Documentation and trouble loading textcat model

Hey there,

I slightly modified the textcat.batch-train recipy to save the best_model as a string with nlp.to_bytes() as I only want to store the bare minimum of this model in a cloud storage bucket.

But now I struggle reading it back. When I do

nlp = spacy.blank('en', pipeline=[])
with open('/Users/dedan/Downloads/my-test-file.spacymodel', 'rb') as f:
    nlp = nlp.from_bytes(f.read())

Then nlp.pip_names is still empty.

I also tried examples form the TextCategorizer docs, but no success here either.

nlp = spacy.blank('en', pipeline=[])
textcat = TextCategorizer(nlp.vocab)
with open('/Users/dedan/Downloads/my-test-file.txt', 'rb') as f:
    textcat.from_bytes(f.read())
textcat('This is a sentence')

Throws an error TypeError: 'bool' object is not callable, which lead me to this issue, but the documentation still seems to be outdated.

I feel like I’m missing something totally obvious, can you please help me with that?

Something like this can't work, because a pipeline component always expects a spaCy Doc object, not a string of text. I'd also generally recommend adding components to the pipeline first and calling them via the nlp object instead of directly on the Doc.

If you just want to save out the byte string, you probably also want to save out the nlp.meta as well (a dictionary that's usually saved out as the meta.json). This lets you reconstruct the full model and pipeline:

# let's assume `meta` is the meta dict and `byte_string` your data

nlp = spacy.blank(meta['lang'])
for pipe_name in meta['pipeline']:
    pipe = nlp.create_pipe(pipe_name)
    nlp.add_pipe(pipe)
nlp.from_bytes(byte_string)

This is almost exactly what happens behind the scenes when you call spacy.load.

Hey Ines,

as always: thank you very much for the fast help!

Constructing the pipelines from meta.json works!

Maybe you can document this somewhere, I could not find general information on how to serialize and load models in the documentation.

1 Like

Sure! Where would you have expected to find this info? Like, which page or section in the docs?

(I think here might be a good place – but don’t click that link before you’ve thought about where you would have expected it, because I’m super interested in that and don’t want to bias your feedback :wink:)

Ok, I thought about it before clicking and thought about here: https://spacy.io/api/top-level

And now knowing more than before, I realized that there you already describe how spacy.load works, I must have overlooked that before. Or I just did not understand it.

The location where you suggested is great I think.

My main problem was that I did not understand what information is actually contained in that byte string. Is it all the vocabulary, or only the weights of my textcat model. I did not think that it contains information about multiple pipes and that I need the information from meta to first reconstruct the structure of a pipeline before I can load the model.

Does that help?