Hi, so I think something weird is going on in morphology.pyx
and assign_tag_id
.
I trained a tagger with custom labels (plural
and singular
).
In [69]: nlp.tagger.labels
Out[69]: ('X', '_SP', 'plural', 'singular')
For this sentence, the tagger looks ok, it’s predicting plural
for the word micron
.
doc = nlp('The median particle size of the milled material was about 3 micron.')
In [74]: nlp.tagger.predict([doc])
Out[74]:
([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0])],
In [88]: [(i.text, i.tag_, i.pos_, i.lemma_) for i in doc]
...:
Out[88]:
[('The', 'X', 'X', 'the'),
('median', 'X', 'X', 'median'),
('particle', 'X', 'X', 'particle'),
('size', 'X', 'X', 'size'),
('of', 'X', 'X', 'of'),
('the', 'X', 'X', 'the'),
('milled', 'X', 'X', 'milled'),
('material', 'X', 'X', 'material'),
('was', 'X', 'X', 'was'),
('about', 'X', 'X', 'about'),
('3', 'X', 'X', '3'),
('micron', 'plural', 'X', 'micron'),
('.', 'X', 'X', '.')]
Now for this sentence, even though the tag_ids
that are predicted from the model show 0
, it’s still predicting labels that are not part of the original label set! (see PRP
and MD
)
In [75]: nlp.tagger.predict([nlp('And for that purpose, I’ll anoint my swords.')])
Out[75]:
([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])],
In [85]: [(i.text, i.tag_, i.pos_, i.lemma_) for i in doc]
Out[85]:
[('And', 'X', 'X', 'and'),
('for', 'X', 'X', 'for'),
('that', 'X', 'X', 'that'),
('purpose', 'X', 'X', 'purpose'),
(',', 'X', 'X', ','),
('I', 'PRP', 'PRON', '-PRON-'),
('’ll', 'MD', 'VERB', 'will'),
('anoint', 'X', 'X', 'anoint'),
('my', 'X', 'X', 'my'),
('swords', 'X', 'X', 'swords'),
('.', 'X', 'X', '.')]
So set_annotations
inside Tagger
changes the tagger predictions. set_annotations
calls assign_tag_id
inside morphology.pyx
and I think something is going on here… I also notice that the weird predictions happen when the lemma
is different from the original word.
Possible related issue: https://github.com/explosion/spaCy/issues/3268