We have been developing a lemmatizer for ancient Greek that was working well with spaCy 2 but does not work in spaCy 3. We have been developing an ancient Greek model to use with Prodigy, and started porting it because we wanted to take full advantage of spaCy 3 and Prodigy, especially to use it with the dependency parser and relation extraction.
With this lemmatizer (which is pos-based and follows as an example the Polish lemmatizer), we can train the spaCy 3 parser but not the tagger. I think the problem has to do with the labels that are being extracted from the UD corpus during the training process.
After running spacy init labels, the labels file looks like this:
Which are not the pos-tags that the lemmatizer is expecting. The train command and init labels are reading the 5th and not the 4th column of the UD file as pos tags. We could handle this in spaCy 2 with the tag_map file, but the tag_map is gone in spacy 3.
How do I get the train command to read the pos tags and not the fine grained tags during the training process?
The default solution is to use the
morphologizer instead of the
tagger for POS tags. With the default setup the model will learn to predict the values from the UPOS+FEATS columns.
morphologizer is basically just a tagger that has some extra machinery to handle
Feat=Val in the labels.
POS is added as
POS=X to the FEATS and after the model runs, it's split off as
token.pos instead of being included as a feature in
Obviously you can adjust your training data or the converter if you want to provide the UPOS column as
tag to the
tagger component. Or you can keep using
tagger and use the new
attribute_ruler to map the tags from the fine-grained tags as in v2: https://nightly.spacy.io/usage/v3#migrating-training-mappings-exceptions
Thanks for the information. I overlooked the porting section in the documentation. Now the lemmatizer is working using the morphologizer.
After reading this, looking at the various language classes provided by spacy 3, and the statements in the API that TAG_MAP has been replaced by the AttributeRuler, I am a little confused. Are some of the language classes that come with spacy 3 still using the older method of specifying token attributes, while others use the newer method? I am trying to create a new language class by modifying one of the existing ones, if it is o.k. to use ones that include tag_map.py.