Transform annotations to match tokenization required for SpanBERT/BERT

Hi team,

We are using prodigy a lot in my place of work and we are trying to evaluate how much better transformer models can be compared to the ones we already have (built using the default prodigy native ones). We already have big annotated datasets for training spancat models but these have been annotated using the default spacy tokenizer.

My question is whether there is an automated/native way of transforming our annotations (sentences that were tokenized using the default prodigy tokenizer) to BERT's tokenizer (SpanBERT in that case) that can be used later to train a trasnformer's model using the new train config capabilities.

In other words, I have a huge annotated dataset that uses the default spacy tokenizer. I would like to somehow transform it to BERT tokenization without losing the annotated spans. Do I have to re-do the annotations from scratch to accommodate for that or is there a better/automated way?

Hope my question makes sense :smiley:

Thanks in advance

1 Like

Have you seen our section on Efficient annotation for transformers like BERT?

To quote what is listed there:

spaCy v3 lets you train a transformer-based pipeline and will take care of all tokenization alignment under the hood, to ensure that the subword tokens match to the linguistic tokenization. You can use data-to-spacy to export your annotations and train with spaCy v3 and a transformer-based config directly, or run train and provide the config via the --config argument.

Or are you interested in training non-spaCy models?

Hi Vincent and thank you for your quick response.

I had looked at the section mentioned, that is why I thought there might be a simpler way :smiley: .
Thanks for confirming it. So If I understood correctly the tokenization will be automatically transformed during train given that I am using spacy models am I right?

PS: I am not interested in non-spacy models at least for the moment.

Thanks again for reaching out.

Regards.

Gotya. Then I think the following steps are what you need.

Step 1: Annotation

Label some data. In my case I've made this example dataset:

{"text": "hi my name is Vincent"}
{"text": "people call me Noa"}

And I've annotated these two examples in Prodigy via:

python -m prodigy ner.manual issue-5923 blank:en examples.jsonl --label name

This saves the NER annotations in a dataset named issue-5923.

Step 2: Convert

Next, let's prepare the data so that spaCy can use it.

python -m prodigy data-to-spacy ./corpus --ner issue-5923 --eval-split 0.5

This creates a folder that contains a train.spacy and a dev.spacy file. These are binary file formats that spaCy expects (they also keep things consistent and lightweight).

Step 3: Config

Before running the train command we need to tell spaCy that we want to run a transformer model. A simple way to set that up is to go to the quick start portion of the spaCy docs and to generate a config with transformers.

This generated the following config for me:

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.ner.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

I've saved this configuration into a file called partial.cfg. As the comments on top of this file confirm though, this is only a partial configuration file. It only contains the parts that define the transformer and the NER output. To make it a complete configuration you'll want to run the following command:

python -m spacy init fill-config partial.cfg full-config.cfg

This will generate a full-config.cfg file that has the missing pieces filled in.

Step 4: Train

Finally, you'll need to run the train command from spaCy to train the transformer model.

python -m spacy train full-config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

Note that I'm manually specifying the --paths.train and --paths.dev parameter here. You should be able to connect these parameters with the settings in the full-config.cfg file. Here's what the output looked like on my machine:

ℹ No output directory provided
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2022-09-13 10:39:51,656] [INFO] Set up nlp object from config
[2022-09-13 10:39:51,664] [INFO] Pipeline: ['transformer', 'ner']
[2022-09-13 10:39:51,667] [INFO] Created vocabulary
[2022-09-13 10:39:51,668] [INFO] Finished initializing nlp object
/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/torch/amp/autocast_mode.py:198: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
[2022-09-13 10:40:01,957] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0           7.57      3.52    0.00    0.00    0.00    0.00
200     200         164.92    130.18  100.00  100.00  100.00    1.00
...

You'll notice that I'm getting warnings about the lack of a GPU. That's the main downside of transformer models. Without a heavy compute resource the really take long to train (which is why I personally try to omit them when possible).

Final Notes

You might notice that there's a bunch of commands here. To keep everything consistent it is recommended to use the spaCy projects feature. You can learn more about it on our docs or in this YouTube video.

2 Likes

Awesome!

Thanks again for the thorough response Vincent!

1 Like

Hello,
I also have the same issue. I have created a spancat dataset using default prodigy settings and now I want to use this dataset for BERT for patents. Following the instructions above, I am able to run the data-to-spacy command and was working on the config file. I could not get the widget as shared in the screenshot above and went for the cli approach. Attached is my cli command, I had to use the pipeline 'ner' option because it does not contain a spancat option and also there is no component option like in the widget which includes spancat. Is my configuration correct or should I make any adjustments?
I am new to this so sorry if my question sounds very basic. @koaning Can you please help.

I am aware of a Firefox bug related to our online widget. But I just went online and selected these settings:

This will include a spancat in the config, which I'm displaying below.

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = "pytorch"

[nlp]
lang = "en"
pipeline = ["transformer","spancat"]
batch_size = 128

[components]

[components.transformer]
factory = "transformer"

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.spancat.model.tok2vec.pooling]
@layers = "reduce_mean.v1"

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256

[initialize]
vectors = ${paths.vectors}

Does this help? You should be able to use this config and follow the same steps that I've outlined above.

Hello,
Thank you for the quick response. I followed the instructions like above and ran the train command. The command has been running for around 2 days now. For how long will it run? Is there any way I can check that? Secondly, I am getting an accuracy of 66% right now which means I am losing a lot of contents of my created dataset when transforming it for Transformers. Is this the best way to go about it or should I do some preprocessing in Python to create Transformers dataset from my Prodigy dataset? Please help.

The command has been running for around 2 days now.

That's a long time. You can choose to run with less steps, which may be fair since the score seems to have flattened out. That said, transformer models require a lot of compute. So we typically advise training with a GPU to speed things up.

But it might also be a good idea to check if you really need a transformer architecture, maybe en_core_web_lg is a more lightweight starting point that still manages an acceptable accuracy.

I am losing a lot of contents of my created dataset when transforming it for Transformers

I'm not sure if I understand. What do you mean with "contents" here?

@koaning Thank you for the response. The command was successful shortly after I posted my question here :smile: , so everything is okay now.
By losing the content of the dataset question, I want to ask what does the 65% accuracy represent? What I understood was that to change my dataset used for spacy, to a dataset that can be used for training a Transformer Algorithm (Bert for Patents in my case) I had to run this Train command. The Train command would change my spacy dataset to Transformer dataset. If that is correct, isn't 65% accuracy too low?

I'm not sure what you mean with "transformer dataset" here. You have a dataset that you've annotated yourself and you're using a model to try and predict the patterns in your data. But this model, in this case one with a transformer architecture, does not produce another "dataset". The 65% that you're seeing here is a metric that tells you how well the model is making predictions, but it doesn't refer to another dataset.

Could you elaborate what you mean with "transformer dataset" here?

I will explain my problem again. I created a spancat dataset using prodigy and then used the prodigy train command for training the algorithm. I have my annotated dataset now and want to use this dataset for BERT for patents.
The problem is that the default prodigy dataset cannot be used for BERT for Patents. I want to find an optimal and easy way such that I do not lose my annotations and don't need to annotate data again for BERT.
I came across this thread and implemented your above solution and got an accuracy of 65%.
Is this model transforming my original Prodigy dataset to a dataset that can be used for Transformer algorithms? or is it training my dataset on a transformer algorithm (Roberta-base or something else)?
Can you suggest me a way to go about it such that I can use it for BERT for Patents?
@koaning I hope I am clear, if not please reach out. Thank you very much for your responses

The annotated data is stored in a database as tokenised text. You can explore these annotations yourself by running the db-out recipe. From here you're free to use the prodigy train ... command to train spaCy models, some of which use transformers, or you can also re-use this data to train what-ever model you like using what-ever framework you prefer.

So the model isn't transforming the original dataset, rather the prodigy train ... command has a step that ensures that the data can be used to train spaCy models. This is the data-to-spacy step that I mentioned here. But you can re-use the same dataset for other models from other ecosystems. This may involve writing some custom Python code, but you're free to what-ever you want with the annotations stored in the database.

I'm not aware of the BERT for patents model. But you're free to export the dataset using prodigy db-out to get a JSONL file that can be used to feed other systems.

If you're following the steps that I mentioned above then the Prodigy dataset is turned into DocBin objects, which represent spaCy documents.

I'm not sure what the requirements are for "BERT for Patents"? Is it a Huggingface model? The main thing, usually, is that you need to make sure that the tokens match. Huggingface transformers usually use a BytePair-esque tokeniser which uses subtokens instead of the more word-level tokens that spaCy uses.

Yes it is a Huggingface model. Attaching the link here for reference : ktgiahieu/bert-for-patents-finetuned-ner · Hugging Face
I was trying to understand the full_config file too that I used for the prodigy train command after the data to spacy command. According to my understanding, it is using the roberta-base model of transformers for training the pipeline. Am I right? I am asking because I will include this part in my thesis research. Can you mention further details of the model and architecture so I can do further research on it before mentioning it in my research?

If you're using en_core_web_trf as a base model then I belive it indeed uses Roberta under the hood. All other relevant details of this model can be found on this model card on the spaCy docs.

Thank you @koaning for the responses. Just one last thing, since I used this model for NER. Can I find the accuracy per label or entity of this model? How can I do it?
When I use the label-stats option with prodigy train, it takes a lot of time and sometimes my laptop crashes.
Thank you

When I use the label-stats option with prodigy train, it takes a lot of time and sometimes my laptop crashes.

Could you elaborate a bit on this? Does this happen at all times or only when you use the transformer model? Does your machine run out of memory here? How big is your dataset? Do you have many documents or very long documents as well?

I have tried twice now and both times my laptop stops working. This problem also occurs when I use other models too. I was using spancat and used prodigy train with label stats and faced the same problem. I think my machine runs out of memory.
The dataset has around 14289 labeled entities. The entities can go up to the length of 20 to 30 words

That does feel like a size-able dataset. How much memory does your machine have? How long are your documents? Are they sentences?

That feels very long. Could you elaborate on the task that you're trying to accomplish? I'm wondering if spancat is the way forward when you have entities that are this long. Are you detecting sentences?

I had the data on spacy and ran the spacy evaluate command. Got the results I needed. Thank you very much for the help @koaning

1 Like