Training new entity type with en_pytt_bertbaseuncased_lg model

Dear @ines and @honnibal,
I read your paper about "spaCy meets PyTorch-Transformers: Fine-tune BERT, XLNet and GPT-2", is there any workflow of the usage of "spacy-pytorch-transformers" for spacy's training new entity type?
I mean, if I can use en_pytt_bertbaseuncased_lg as en_core_web_lg in the spaCy's tain_new_entity_type.py, would it make any difference to use that model instead of en_core_web_lg in terms of increasing the accuracy? If there is some examples/threads about it, could you please advice me?
Thanks,

Hello,
I would really appreciate any comment on this.
Thx,

Hi,

Sorry for the delay getting back to you on this. We do plan to make transformer-based NER models available, but we've been focussed on building out the core feature set of the spacy-pytorch-transformers library first, and making sure it's more stable. There are still some bugs we want to fix for the next release, especially around the serialisation.

We've been working on the text-categorization experiments, and so far we've had trouble getting it to work immediately in Prodigy. The main problem is that transformers need a reasonably high batch size --- around 32 is normal. This is quite a challenge for active learning, which is why we designed the architectures for Prodigy to work the way they do.

Thank you for your reply. I'll be waiting for the release about then.

Hello Matthew,
I tried to use en_pytt_bertbaseuncased_lg model to see what it will give me. I got the following error:

prodigy ner.batch-train sport_terms en_pytt_bertbaseuncased_lg --output model1 --label "SPORT,GPE,ORG" --eval-split 0.2 --n-iter 100 --batch-size 50
Using 3 labels: SPORT, GPE, ORG

Loaded model en_pytt_bertbaseuncased_lg
Using 20% of accept/reject examples (880) for evaluation
Using 100% of remaining examples (3599) for training
Dropout: 0.2  Batch size: 50  Iterations: 100  

BEFORE      0.000             
Correct     0   
Incorrect   170
Entities    0                 
Unknown     0                 

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
  0%|                                                                      | 0/3599 [00:00<?, ?it/s]['U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['B-GPE', 'I-GPE', 'L-GPE', 'O', 'O']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['B-SPORT', 'I-SPORT', 'L-SPORT']
['U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-GPE', 'L-GPE', 'O', 'O', 'B-GPE', 'I-GPE', 'L-GPE', 'O', 'O', 'B-GPE', 'L-GPE', 'O']
['U-ORG']
['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'U-GPE', 'O', 'U-GPE', 'O', 'O']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-SPORT']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']

['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'O', 'O', 'O', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'O', 'O', 'O', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['B-SPORT', 'L-SPORT']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
['U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-ORG', 'U-GPE', 'U-GPE']
Traceback (most recent call last):                                                                  
  File "/Users/lib/python3.7/runpy.py", line 193, in _run_module_as_main  "__main__", mod_spec)
  File "/Users/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals)
  File "/Users/lib/python3.7/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/lib/python3.7/site-packages/plac_core.py", line 328, in call cmd, result = parser.consume(arglist)
  File "/Users/lib/python3.7/site-packages/plac_core.py", line 207, in consume return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 621, in batch_train
    examples, batch_size=batch_size, drop=dropout, beam_width=beam_width
  File "cython_src/prodigy/models/ner.pyx", line 362, in prodigy.models.ner.EntityRecognizer.batch_train
  File "cython_src/prodigy/models/ner.pyx", line 453, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 446, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 447, in prodigy.models.ner.EntityRecognizer._update
  File "/Users/lib/python3.7/site-packages/spacy_pytorch_transformers/language.py", line 83, in update
    tok2vec = self.get_pipe("pytt_tok2vec")
  File "/Users/lib/python3.7/site-packages/spacy/language.py", line 250, in get_pipe
    raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names))
KeyError: "[E001] No component 'pytt_tok2vec' found in pipeline. Available names: ['sentencizer', 'ner']"

Regarding your comment here, is that error expected? Could you please give me some insight on usage of en_pytt_bertbaseuncased_lg language model availability for NER purpose?
Many thanks,

Yes, if you just use the PyTorch-Transformer models and try to train a regular spaCy entity recognizer, this is not going to work. We currently do not have a NER model implementation for spacy-pytorch-transformers. We're still working on that. It's possible, but non-trivial, because we need to write the model implementation. The text classification works, because we've already written the implementations – see here:

If you want to train your own models with the transformers, check out the examples in the repo.

What Matt meant by "trouble getting it to work in Prodigy" is that training a text classifier with a transformer model in the loop isn't that useful yet, because those models are large and require large batch sizes. spaCy's default models have been specifically designed to allow Prodigy-style workflows – the transformer models haven't, so we still need to work on the implementations to make them useful.

Edit: See this comment for more details: https://github.com/explosion/spacy-pytorch-transformers/issues/23#issuecomment-526732208