Problem creating a new language to serve as a base model for further improvement in Prodigy

Hi,

In an attempt to bootstrap a model that we can improve in Prodigy, I've trained a Welsh model from scratch using the simple training format. It has a functional tagger, lemmatizer, vectors and tokenizer, but no parser (yet).

The tagger's accuracy with basic UD tags is already pretty high, but our intention was to use it as a base model to be further improved upon with additional annotation in Prodigy. We have licenses and are hoping to add further functionality such as NER using Prodigy down the line (happy to contribute our work back to spaCy too).

However, when I reload the trained model in spaCy, the tagger produces different results, and is pretty consistently incorrect with certain POS tags - it's almost as if the TAG_MAP has slipped.

Model's tagging results immediately after training (all correct):

roedd VERB
y DET
dynion NOUN
yn PART
hapus ADJ

Model's tagging results when immediately reloaded (mistakes in bold):

roedd VERB roedd
y **PUNCT**
dynion NOUN
yn **ADJ**
hapus ADJ

I've confirmed that the lang tag_map and training tag_map ar the same.

Having looked at the model's meta.json, there is an additional "_SP" tag in the TAG_MAP that is not in my training TAG_MAP (nor is it in the training data).

Is this a bug in spaCy with the simple training format (perhaps related to https://github.com/explosion/spaCy/issues/5648)?

Is there a better approach to bootstrapping a model for a new language using Prodigy?

Thanks :slight_smile:

I've now managed to load my Welsh model into Prodigy after placing a prodigy.json file in my virtual environment.

Surprisingly, the model seems to tag as it should when loaded into Prodigy, even though this isn't the case when reloaded in spaCy.

Could this be due to Prodigy applying its own TAG_MAP?

:information_source: Using universal coarse-grained POS tags: ADJ, ADP, ADV, AUX, CONJ,
CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X,
SPACE

Is there any way to save out the model when using pos.correct to try loading it in spaCy?

I'm using spacy 2.3.2 to train the model and Prodigy vesion 1.10 on Ubuntu 18.

Hi @Gruff,

It does sound like this is a bug in spaCy from the TAG_MAP stuff. Adriane is on holiday at the moment so I'll need to get back to you on this once she's back next week. This type of problem should be impossible in v3, as we've decoupled the morphology and lemmatization from the tagger finally. I'm sure we'll be able to make a patch to v2.3 to fix the specific issue as well.

One way you could probe this is to check the tagger.vocab.morphology.tag_map and tagger.vocab.morphology.reverse_index attributes after the model is loaded in spaCy. This should give you a definitive answer on whether your guess about the tag map is correct. You could modify these variables in-place as a quick fix if you can't come up with a better solution.

If the problem is what you suspect, the best thing to do is to make an issue on spaCy, you can just leave a brief note and link this thread. We should be able to have a patch out in a couple of weeks.

Ah, brilliant @honnibal,

Thanks for that! I was able to use:

tagger.vocab.morphology.tag_map.pop('_SP',none)
tagger.vocab.morphology.reverse_index.pop('6893682062797376370',none)

to get rid of the extra entries.

After saving the model and reloading it, the tagger tags as expected :smiley:

Thanks again for your help. I'll create an issue on the spaCy project and flag it up for Adriane.

I saw the work on the lemmatization and morphology - she deserves the holiday! :slight_smile:

2 Likes