Transform annotations to match tokenization required for SpanBERT/BERT

Hi team,

We are using prodigy a lot in my place of work and we are trying to evaluate how much better transformer models can be compared to the ones we already have (built using the default prodigy native ones). We already have big annotated datasets for training spancat models but these have been annotated using the default spacy tokenizer.

My question is whether there is an automated/native way of transforming our annotations (sentences that were tokenized using the default prodigy tokenizer) to BERT's tokenizer (SpanBERT in that case) that can be used later to train a trasnformer's model using the new train config capabilities.

In other words, I have a huge annotated dataset that uses the default spacy tokenizer. I would like to somehow transform it to BERT tokenization without losing the annotated spans. Do I have to re-do the annotations from scratch to accommodate for that or is there a better/automated way?

Hope my question makes sense :smiley:

Thanks in advance

1 Like

Have you seen our section on Efficient annotation for transformers like BERT?

To quote what is listed there:

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

Or are you interested in training non-spaCy models?

Hi Vincent and thank you for your quick response.

I had looked at the section mentioned, that is why I thought there might be a simpler way :smiley: .
Thanks for confirming it. So If I understood correctly the tokenization will be automatically transformed during train given that I am using spacy models am I right?

PS: I am not interested in non-spacy models at least for the moment.

Thanks again for reaching out.

Regards.

Gotya. Then I think the following steps are what you need.

Step 1: Annotation

Label some data. In my case I've made this example dataset:

{"text": "hi my name is Vincent"}
{"text": "people call me Noa"}

And I've annotated these two examples in Prodigy via:

python -m prodigy ner.manual issue-5923 blank:en examples.jsonl --label name

This saves the NER annotations in a dataset named issue-5923.

Step 2: Convert

Next, let's prepare the data so that spaCy can use it.

python -m prodigy data-to-spacy ./corpus --ner issue-5923 --eval-split 0.5

This creates a folder that contains a train.spacy and a dev.spacy file. These are binary file formats that spaCy expects (they also keep things consistent and lightweight).

Step 3: Config

Before running the train command we need to tell spaCy that we want to run a transformer model. A simple way to set that up is to go to the quick start portion of the spaCy docs and to generate a config with transformers.

This generated the following config for me:

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

I've saved this configuration into a file called partial.cfg. As the comments on top of this file confirm though, this is only a partial configuration file. It only contains the parts that define the transformer and the NER output. To make it a complete configuration you'll want to run the following command:

python -m spacy init fill-config partial.cfg full-config.cfg

This will generate a full-config.cfg file that has the missing pieces filled in.

Step 4: Train

Finally, you'll need to run the train command from spaCy to train the transformer model.

python -m spacy train full-config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

Note that I'm manually specifying the --paths.train and --paths.dev parameter here. You should be able to connect these parameters with the settings in the full-config.cfg file. Here's what the output looked like on my machine:

ℹ No output directory provided
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-09-13 10:39:51,656] [INFO] Set up nlp object from config
[2022-09-13 10:39:51,664] [INFO] Pipeline: ['transformer', 'ner']
[2022-09-13 10:39:51,667] [INFO] Created vocabulary
[2022-09-13 10:39:51,668] [INFO] Finished initializing nlp object
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
[2022-09-13 10:40:01,957] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0           7.57      3.52    0.00    0.00    0.00    0.00
200     200         164.92    130.18  100.00  100.00  100.00    1.00
...

You'll notice that I'm getting warnings about the lack of a GPU. That's the main downside of transformer models. Without a heavy compute resource the really take long to train (which is why I personally try to omit them when possible).

Final Notes

You might notice that there's a bunch of commands here. To keep everything consistent it is recommended to use the spaCy projects feature. You can learn more about it on our docs or in this YouTube video.

2 Likes

Awesome!

Thanks again for the thorough response Vincent!

1 Like