Hi! We don't currently have an NER implementation that uses transformer weights in spaCy v2.x, so your approach wouldn't work – but once spaCy v3 is out, we'll have an updated version of Prodigy that will let you use transformer-based pipelines, pipelines with custom models in PyTorch/TF and pretty much everything else that spaCy v3 offers.
(The error you came across here btw looks like a different problem: internally, the NER annotation model deepcopies/pickles the nlp
object and it looks like pickle fails on the type hints. It's possible that this is a Python 3.6 issue, I'm not 100% sure.)
Yes, that's correct – your recipe should return a callback that receives the answers, and that updates the model in the loop. In spaCy's case, that would be by calling nlp.update
. The same approach could also work for any other model or library – you just need to update your model with examples, and provide this as a callback function.
One thing to keep in mind when working with transformers is that they're still quite large and slow, especially compared to the more lightweight CNN pipelines like spaCy's en_core_web_sm
. They also typically require larger batch sizes. On top of that, a workflow like ner.teach
only gives you very sparse data (binary feedback on single spans with otherwise missing values). So in my earlier experiments, I found it quite tricky to make the continuous updating work smoothly with them, because the updates from very small batches of annotations would take very long and weren't as effective. So you might find that it's more efficient to start by labelling a small set of examples manually, training a transformer-based pipeline on them (which will hopefully get you good results even with only a very small set), and then use that pipeline to help you label data semi-automatically, using a workflow like ner.correct
.