How to use a (sentence targeted) textcat model together with the core model

I created a textcat model following the examples from the prodigy documentation, that i now can load as nlp=spacy.load(‘my_model’)

The first question is:
What is the best way to integrate that model into the core pipeline?

My current approach is:

nlc = spacy.load('my_model')
tc = nlc.get_pipe('textcat')

nlp = spacy.load('en_core_web_md')
nlp.add_pipe(tc, last=True)

the above seems to work, but reversing the model loading does not:

nlp = spacy.load('en_core_web_md')
nlc = spacy.load('my_model')

tc = nlc.get_pipe('textcat')
nlp.add_pipe(tc, last=True)

doc = nlp(text)

doc = nlp(text)
File “/usr/local/lib/python3.5/site-packages/spacy/language.py”, line 333, in call
doc = proc(doc)
File “pipeline.pyx”, line 390, in spacy.pipeline.Tagger.call
File “pipeline.pyx”, line 402, in spacy.pipeline.Tagger.predict
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 55, in predict
X = layer(X)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 293, in predict
X = layer(layer.ops.flatten(seqs_in, pad=pad))
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 55, in predict
X = layer(X)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 125, in predict
y, _ = self.begin_update(X)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 372, in uniqued_fwd
Y_uniq, bp_Y_uniq = layer.begin_update(X[ind], drop=drop)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 61, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/static_vectors.py”, line 67, in begin_update
dotted = self.ops.batch_dot(vectors, self.W)
File “ops.pyx”, line 299, in thinc.neural.ops.NumpyOps.batch_dot
ValueError: shapes (9,0) and (300,128) not aligned: 0 (dim 1) != 300 (dim 0)

(this error only occurs when using the md or lg core model, not with the sm)

also i am not sure about the “sbd” component in nlc (nlc pipe names are [‘sbd’, ‘textcat’])

Now, in another scenario, but following the same approach, i have created a sentence classifier that i would like to include into the core pipeline, so that I can pass-in larger documents, be able to iterate over the sentences in a document, and get a classification score per sentence instead of per document.

my current approach is:
having nlp as the core model and nlc as “my_model” (as above but standalone without adding one to the pipe of the other), i do:

doc = nlp(text)
for s in doc.sents:
    sentence_score = nlc(s.text).cats['my_score']

is there a better way to do that?

Thank you.

(also: thank you for building awesome tools!)

The intended workflow is to get the pipeline to a state you’re happy with, and then save it out with nlp.to_disk(), and then run spacy package on that directory to generate the package wrapping code. If you need to have any setup code, you can put that in the __init__.py of your package – for instance, if you need to add a custom factory, or have some post-loading code that’s awkward to get serialised. At that point you should be able to run python setup.py sdist or python setup.py bdist_wheel to get a source distribution or wheel for your target platform.

You can then install the trained model as a package. This takes care of model versioning for you as well, because you can attach a version to the package you create, and have your production deployment declare which model versions it’s tested with.

The weird error you’re seeing from the ordering of the two model commands is unfortunate. Currently the pre-trained vectors register themselves in a global variable within thinc. The global variable is used to avoid serialising the pretrained vectors within the model, as this would blow up the model size. Obviously this isn’t a great solution, though. I look forward to improving it in future.

Oh ok, so in the first example, after doing add_pipe i can just write the whole pipeline to disk and make that my new core+custom model? cool.

how about the sentence level scores? seems like it needs some custom code somewhere to score each sentence (and maybe have an average (sentence) score on the document)?

In terms of the pre-trained vectors that makes sense. The nlp.to_disk() would still serialize them if i wrote the core+custom model as above, right ?