Is it possible to train a tagger with custom labels?

plusepsilon · February 12, 2019, 4:00pm

For ner training, we can provide custom labels completely independent of the original labels. It doesn’t seem like it’s possible for taggers.

TAG_MAP = {
    'N': {'pos': 'NOUN'},  # change this
    'V': {'pos': 'VERB'},  # change this
    'J': {'pos': 'ADJ'}  # change this
}

More generally, we’d like to train a model that is pure sequence tagging (ner w/o the transition system or tagger w/ custom labels). Would there be a simple way to do this?

Thanks!

honnibal · February 13, 2019, 10:23am

The .pos attribute is typed with an enum, so its values are hard-coded to the universal dependency tags.

Can you just use the .tag attribute, and map all the .pos values to like X or something? It shouldn’t matter — the .pos attribute isn’t used internally as a feature for anything downstream.

plusepsilon · February 13, 2019, 6:40pm

Oh nice, so something like this:

TAG_MAP = {
    'COLOR': {'pos': 'X'},
    'X': {'pos': 'X'},
}

and

TRAIN_DATA = [
    ("I like green eggs", {'tags': ['X', 'X', 'COLOR', 'X']}),
    ("Eat blue ham", {'tags': ['X', 'COLOR', 'X']})
]

After training I get:

print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
# Tags [('I', 'X', 'X'), ('like', 'X', 'X'), ('blue', 'COLOR', 'X'), ('eggs', 'X', 'X')]

plusepsilon · February 22, 2019, 11:36pm

Hi, so I think something weird is going on in morphology.pyx and assign_tag_id.

I trained a tagger with custom labels (plural and singular).

In [69]: nlp.tagger.labels
Out[69]: ('X', '_SP', 'plural', 'singular')

For this sentence, the tagger looks ok, it’s predicting plural for the word micron.

doc = nlp('The median particle size of the milled material was about 3 micron.')

In [74]: nlp.tagger.predict([doc])
Out[74]:
([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0])],

In [88]: [(i.text, i.tag_, i.pos_, i.lemma_) for i in doc]
    ...:
Out[88]:
[('The', 'X', 'X', 'the'),
 ('median', 'X', 'X', 'median'),
 ('particle', 'X', 'X', 'particle'),
 ('size', 'X', 'X', 'size'),
 ('of', 'X', 'X', 'of'),
 ('the', 'X', 'X', 'the'),
 ('milled', 'X', 'X', 'milled'),
 ('material', 'X', 'X', 'material'),
 ('was', 'X', 'X', 'was'),
 ('about', 'X', 'X', 'about'),
 ('3', 'X', 'X', '3'),
 ('micron', 'plural', 'X', 'micron'),
 ('.', 'X', 'X', '.')]

Now for this sentence, even though the tag_ids that are predicted from the model show 0, it’s still predicting labels that are not part of the original label set! (see PRP and MD)

In [75]: nlp.tagger.predict([nlp('And for that purpose, I’ll anoint my swords.')])
Out[75]:
([array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])],

In [85]: [(i.text, i.tag_, i.pos_, i.lemma_) for i in doc]
Out[85]:
[('And', 'X', 'X', 'and'),
 ('for', 'X', 'X', 'for'),
 ('that', 'X', 'X', 'that'),
 ('purpose', 'X', 'X', 'purpose'),
 (',', 'X', 'X', ','),
 ('I', 'PRP', 'PRON', '-PRON-'),
 ('’ll', 'MD', 'VERB', 'will'),
 ('anoint', 'X', 'X', 'anoint'),
 ('my', 'X', 'X', 'my'),
 ('swords', 'X', 'X', 'swords'),
 ('.', 'X', 'X', '.')]

So set_annotations inside Tagger changes the tagger predictions. set_annotations calls assign_tag_id inside morphology.pyx and I think something is going on here… I also notice that the weird predictions happen when the lemma is different from the original word.

Possible related issue: https://github.com/explosion/spaCy/issues/3268

honnibal · February 26, 2019, 10:56am

I’m pretty sure the confusing behaviour is coming in from the tokenizer’s exception rules. I made the regrettable decision very early to allow the tokenizer to set token attributes like the tag, lemma etc in the exception rules. I definitely want to change this in v3.

I think something like this should work:


for chunk, substrings in nlp.tokenizer._rules.items():
    for token in substrings:
        for attr in (TAG, LEMMA, POS):
            if attr in token:
                token.pop(attr)
    nlp.tokenizer.add_special_case(chunk, substrings)

Here we’re overwriting the special-case rule with a new version, where the tags aren’t set.

Topic		Replies	Views
Custom POS tag model and errors spacy , custom , pos	3	2350	January 16, 2019
Train POS tagger after custom tokenization spacy	1	757	July 23, 2020
POS-tags messed up after ner.batch-train ner	1	474	April 18, 2018
Documentation NER ner , spacy , solved , pos , off-topic	4	632	May 29, 2020
Linguistic features configured for a non-english model usage , spacy , solved	2	461	January 11, 2019

Is it possible to train a tagger with custom labels?

Related topics