I've enjoyed extending prodigy at my medium-sized startup, and most things have been pretty smooth.
Unfortunately, I've run into some snags with extending POS tags. I'm making a truecaser which will have 3 tags: lower, upper and capital. I've read this thread, but I'm not having the luck that they were:
The spacy documentation says that the keys should be tags from our set which they are:
The keys of the tag map should be strings in your tag set. The values should be a dictionary. The dictionary must have an entry POS whose value is one of the Universal Dependencies tags.
It didn't work. I inspected the code as much as is possible (calling help() and interacting with it. BTW, it would be very helpful if you could document the input and output of the Cythonized methods. I noticed that even with en_vectors_web_lg, the default model has 57 output tags. I get this error even though there was no pre-trained model:
ValueError: [T003] Resizing pre-trained Tagger models is not currently supported.
I realize I can use NER for this case, but that's a more complicated model. Ideally, we would like to train a tagger with arbitrary tags. Please let me know if this is possible currently or will be soon...
Thanks for the suggestion. Sorry I did not give you the concrete error during pos.batch-train:
File “cython_src/prodigy/models/pos.pyx”, line 90, in prodigy.models.pos.Tagger.batch_train
File “cython_src/prodigy/models/pos.pyx”, line 136, in prodigy.models.pos.Tagger.update
File “cython_src/prodigy/models/pos.pyx”, line 144, in prodigy.models.pos.Tagger.inc_gradient
File “cython_src/prodigy/models/pos.pyx”, line 79, in prodigy.models.pos.Tagger.get_label_index
ValueError: tuple.index(x): x not in tuple
Your suggestion didn’t work when I tried editing the source recipe file; ie. adding those two lines to prodigy.recipes.pos.py under site-packages and attempting batch train.
I also tried to make English.Defaults.tag_map equal the input tag map.
I also tried to make the values numbers in the tag map.
I’m almost done with my framework for custom models. I would have liked to plug into the pos tagger for this problem. I realize you have more important and exciting things to work for your next release though.
It sounds like your base model might still have the default or a blank label set. Since you’re using a fully custom label scheme and tag map and starting off with a model without a tagger, you probably want to define your label set explicitly upfront by calling tagger.add_label().
Before you start training, you can check that the following are set up as expected in your base model:
If there’s a mismatch here, spaCy will fail to reconcile the annotation for the update.
Also, just to confirm: Do you already have annotations and just want to train the tagger? And are your annotations gold-standard? Because in that case, you don’t necessarily have to train with Prodigy (which is really just a layer on top of spaCy’s nlp.update) and you can also export the data and then call into spaCy directly. For a very custom use case like this, that might even be better, because you can write your own training loop, tweak more hyperparameters and really optimise for your final model.