Is it possible to train a tagger with custom labels?

For ner training, we can provide custom labels completely independent of the original labels. It doesn’t seem like it’s possible for taggers.

TAG_MAP = {
    'N': {'pos': 'NOUN'},  # change this
    'V': {'pos': 'VERB'},  # change this
    'J': {'pos': 'ADJ'}  # change this
}

More generally, we’d like to train a model that is pure sequence tagging (ner w/o the transition system or tagger w/ custom labels). Would there be a simple way to do this?

Thanks!

The .pos attribute is typed with an enum, so its values are hard-coded to the universal dependency tags.

Can you just use the .tag attribute, and map all the .pos values to like X or something? It shouldn’t matter — the .pos attribute isn’t used internally as a feature for anything downstream.

1 Like

Oh nice, so something like this:

TAG_MAP = {
    'COLOR': {'pos': 'X'},
    'X': {'pos': 'X'},
}

and

TRAIN_DATA = [
    ("I like green eggs", {'tags': ['X', 'X', 'COLOR', 'X']}),
    ("Eat blue ham", {'tags': ['X', 'COLOR', 'X']})
]

After training I get:

print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
# Tags [('I', 'X', 'X'), ('like', 'X', 'X'), ('blue', 'COLOR', 'X'), ('eggs', 'X', 'X')]
1 Like

Hi, so I think something weird is going on in morphology.pyx and assign_tag_id.

I trained a tagger with custom labels (plural and singular).

In [69]: nlp.tagger.labels
Out[69]: ('X', '_SP', 'plural', 'singular')

For this sentence, the tagger looks ok, it’s predicting plural for the word micron.

doc = nlp('The median particle size of the milled material was about 3 micron.')

In [74]: nlp.tagger.predict([doc])
Out[74]:
([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0])],

In [88]: [(i.text, i.tag_, i.pos_, i.lemma_) for i in doc]
    ...:
Out[88]:
[('The', 'X', 'X', 'the'),
 ('median', 'X', 'X', 'median'),
 ('particle', 'X', 'X', 'particle'),
 ('size', 'X', 'X', 'size'),
 ('of', 'X', 'X', 'of'),
 ('the', 'X', 'X', 'the'),
 ('milled', 'X', 'X', 'milled'),
 ('material', 'X', 'X', 'material'),
 ('was', 'X', 'X', 'was'),
 ('about', 'X', 'X', 'about'),
 ('3', 'X', 'X', '3'),
 ('micron', 'plural', 'X', 'micron'),
 ('.', 'X', 'X', '.')]

Now for this sentence, even though the tag_ids that are predicted from the model show 0, it’s still predicting labels that are not part of the original label set! (see PRP and MD)

In [75]: nlp.tagger.predict([nlp('And for that purpose, I’ll anoint my swords.')])
Out[75]:
([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])],

In [85]: [(i.text, i.tag_, i.pos_, i.lemma_) for i in doc]
Out[85]:
[('And', 'X', 'X', 'and'),
 ('for', 'X', 'X', 'for'),
 ('that', 'X', 'X', 'that'),
 ('purpose', 'X', 'X', 'purpose'),
 (',', 'X', 'X', ','),
 ('I', 'PRP', 'PRON', '-PRON-'),
 ('’ll', 'MD', 'VERB', 'will'),
 ('anoint', 'X', 'X', 'anoint'),
 ('my', 'X', 'X', 'my'),
 ('swords', 'X', 'X', 'swords'),
 ('.', 'X', 'X', '.')]

So set_annotations inside Tagger changes the tagger predictions. set_annotations calls assign_tag_id inside morphology.pyx and I think something is going on here… I also notice that the weird predictions happen when the lemma is different from the original word.

Possible related issue: https://github.com/explosion/spaCy/issues/3268

I’m pretty sure the confusing behaviour is coming in from the tokenizer’s exception rules. I made the regrettable decision very early to allow the tokenizer to set token attributes like the tag, lemma etc in the exception rules. I definitely want to change this in v3.

I think something like this should work:


for chunk, substrings in nlp.tokenizer._rules.items():
    for token in substrings:
        for attr in (TAG, LEMMA, POS):
            if attr in token:
                token.pop(attr)
    nlp.tokenizer.add_special_case(chunk, substrings)

Here we’re overwriting the special-case rule with a new version, where the tags aren’t set.

1 Like