In an attempt to bootstrap a model that we can improve in Prodigy, I've trained a Welsh model from scratch using the simple training format. It has a functional tagger, lemmatizer, vectors and tokenizer, but no parser (yet).
The tagger's accuracy with basic UD tags is already pretty high, but our intention was to use it as a base model to be further improved upon with additional annotation in Prodigy. We have licenses and are hoping to add further functionality such as NER using Prodigy down the line (happy to contribute our work back to spaCy too).
However, when I reload the trained model in spaCy, the tagger produces different results, and is pretty consistently incorrect with certain POS tags - it's almost as if the TAG_MAP has slipped.
Model's tagging results immediately after training (all correct):
roedd VERB
y DET
dynion NOUN
yn PART
hapus ADJ
Model's tagging results when immediately reloaded (mistakes in bold):
roedd VERB roedd
y **PUNCT**
dynion NOUN
yn **ADJ**
hapus ADJ
I've confirmed that the lang tag_map and training tag_map ar the same.
Having looked at the model's meta.json, there is an additional "_SP" tag in the TAG_MAP that is not in my training TAG_MAP (nor is it in the training data).
It does sound like this is a bug in spaCy from the TAG_MAP stuff. Adriane is on holiday at the moment so I'll need to get back to you on this once she's back next week. This type of problem should be impossible in v3, as we've decoupled the morphology and lemmatization from the tagger finally. I'm sure we'll be able to make a patch to v2.3 to fix the specific issue as well.
One way you could probe this is to check the tagger.vocab.morphology.tag_map and tagger.vocab.morphology.reverse_index attributes after the model is loaded in spaCy. This should give you a definitive answer on whether your guess about the tag map is correct. You could modify these variables in-place as a quick fix if you can't come up with a better solution.
If the problem is what you suspect, the best thing to do is to make an issue on spaCy, you can just leave a brief note and link this thread. We should be able to have a patch out in a couple of weeks.