I created a textcat model following the examples from the prodigy documentation, that i now can load as nlp=spacy.load(‘my_model’)
The first question is:
What is the best way to integrate that model into the core pipeline?
My current approach is:
nlc = spacy.load('my_model')
tc = nlc.get_pipe('textcat')
nlp = spacy.load('en_core_web_md')
nlp.add_pipe(tc, last=True)
the above seems to work, but reversing the model loading does not:
nlp = spacy.load('en_core_web_md')
nlc = spacy.load('my_model')
tc = nlc.get_pipe('textcat')
nlp.add_pipe(tc, last=True)
doc = nlp(text)
doc = nlp(text)
File “/usr/local/lib/python3.5/site-packages/spacy/language.py”, line 333, in call
doc = proc(doc)
File “pipeline.pyx”, line 390, in spacy.pipeline.Tagger.call
File “pipeline.pyx”, line 402, in spacy.pipeline.Tagger.predict
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 55, in predict
X = layer(X)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 293, in predict
X = layer(layer.ops.flatten(seqs_in, pad=pad))
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 55, in predict
X = layer(X)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 125, in predict
y, _ = self.begin_update(X)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 372, in uniqued_fwd
Y_uniq, bp_Y_uniq = layer.begin_update(X[ind], drop=drop)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 61, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in begin_update
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 176, in
values = [fwd(X, *a, **k) for fwd in forward]
File “/usr/local/lib/python3.5/site-packages/thinc/api.py”, line 258, in wrap
output = func(*args, **kwargs)
File “/usr/local/lib/python3.5/site-packages/thinc/neural/_classes/static_vectors.py”, line 67, in begin_update
dotted = self.ops.batch_dot(vectors, self.W)
File “ops.pyx”, line 299, in thinc.neural.ops.NumpyOps.batch_dot
ValueError: shapes (9,0) and (300,128) not aligned: 0 (dim 1) != 300 (dim 0)
(this error only occurs when using the md or lg core model, not with the sm)
also i am not sure about the “sbd” component in nlc (nlc pipe names are [‘sbd’, ‘textcat’])
Now, in another scenario, but following the same approach, i have created a sentence classifier that i would like to include into the core pipeline, so that I can pass-in larger documents, be able to iterate over the sentences in a document, and get a classification score per sentence instead of per document.
my current approach is:
having nlp as the core model and nlc as “my_model” (as above but standalone without adding one to the pipe of the other), i do:
doc = nlp(text)
for s in doc.sents:
sentence_score = nlc(s.text).cats['my_score']
is there a better way to do that?
Thank you.
(also: thank you for building awesome tools!)