Custom recipe to teach a DistilBERT model with custom labels

Hi,

I would like to create a custom recipe to "teach" a model based on multilingual DistilBERT (as provided by HuggingFace) with custom labels. I am not sure what the necessary steps are. I see there is a ner.teach recipe which returns an update object, which points to the update() function of a spaCy model (is this correct?). So it is my understanding that I have to create one that points to the update() function of the model I need. However, I don't think such model is currently available from spaCy, but there is a en_trf_distilbertbaseuncased_lg .

Thanks any help,
Riccardo

In addition, I tried to do the following:

import spacy
from prodigy.models.ner import EntityRecognizer
nlp = spacy.load("en_trf_distilbertbaseuncased_lg")
model = EntityRecognizer(nlp, label=["TEST1", "TEST2"])

But I got an error with a very long stack trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cython_src/prodigy/models/ner.pyx", line 175, in prodigy.models.ner.EntityRecognizer.__init__
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 215, in _deepcopy_list
    append(deepcopy(a, memo))
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.6/copy.py", line 220, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 169, in deepcopy
    rv = reductor(4)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/thinc/neural/_classes/model.py", line 98, in __getstate__
    return srsly.pickle_dumps(self.__dict__)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/_pickle_api.py", line 14, in pickle_dumps
    return cloudpickle.dumps(data, protocol=protocol)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/cloudpickle/cloudpickle.py", line 1125, in dumps
    cp.dump(obj)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/cloudpickle/cloudpickle.py", line 482, in dump
    return Pickler.dump(self, obj)
  File "/usr/lib/python3.6/pickle.py", line 409, in dump
    self.save(obj)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/cloudpickle/cloudpickle.py", line 556, in save_function
    return self.save_function_tuple(obj)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/cloudpickle/cloudpickle.py", line 758, in save_function_tuple
    save(state)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 490, in save
    self.save_global(obj)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/cloudpickle/cloudpickle.py", line 877, in save_global
    self.save_dynamic_class(obj)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/cloudpickle/cloudpickle.py", line 686, in save_dynamic_class
    save(clsdict)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
    save(v)
  File "/usr/lib/python3.6/pickle.py", line 476, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python3.6/pickle.py", line 751, in save_tuple
    save(element)
  File "/usr/lib/python3.6/pickle.py", line 507, in save
    self.save_global(obj, rv)
  File "/opt/prodigy/venv/lib/python3.6/site-packages/srsly/cloudpickle/cloudpickle.py", line 875, in save_global
    Pickler.save_global(self, obj, name=name)
  File "/usr/lib/python3.6/pickle.py", line 927, in save_global
    (obj, module_name, name))
_pickle.PicklingError: Can't pickle typing.Union[_ForwardRef('numpy.ndarray'), _ForwardRef('cupy.ndarray')]: it's not the same object as typing.Union

Hi! We don't currently have an NER implementation that uses transformer weights in spaCy v2.x, so your approach wouldn't work – but once spaCy v3 is out, we'll have an updated version of Prodigy that will let you use transformer-based pipelines, pipelines with custom models in PyTorch/TF and pretty much everything else that spaCy v3 offers.

(The error you came across here btw looks like a different problem: internally, the NER annotation model deepcopies/pickles the nlp object and it looks like pickle fails on the type hints. It's possible that this is a Python 3.6 issue, I'm not 100% sure.)

Yes, that's correct – your recipe should return a callback that receives the answers, and that updates the model in the loop. In spaCy's case, that would be by calling nlp.update. The same approach could also work for any other model or library – you just need to update your model with examples, and provide this as a callback function.

One thing to keep in mind when working with transformers is that they're still quite large and slow, especially compared to the more lightweight CNN pipelines like spaCy's en_core_web_sm. They also typically require larger batch sizes. On top of that, a workflow like ner.teach only gives you very sparse data (binary feedback on single spans with otherwise missing values). So in my earlier experiments, I found it quite tricky to make the continuous updating work smoothly with them, because the updates from very small batches of annotations would take very long and weren't as effective. So you might find that it's more efficient to start by labelling a small set of examples manually, training a transformer-based pipeline on them (which will hopefully get you good results even with only a very small set), and then use that pipeline to help you label data semi-automatically, using a workflow like ner.correct.