is there a way to change prodigy annotations to transformers-based annotations, without re-annotating?

mumud123 · March 1, 2021, 5:04pm

Hi, I went through making an NER pipeline with spacy and annotated some data. Now I want to try training a transformers-based model, but I see that the tokenization is different -- is it possible to convert my previous annotations to transformers tokenization, or do I need to re-do my annotations with the new tokenization?

ines · March 3, 2021, 12:47am

Hi! How are your training your transformer-based model? If you're using spaCy v3, spaCy will take care of the tokenization alignment under the hood, and you'll be able to train from data with linguistically-motivated tokens, while still using the word piece tokens provided by the transformer (which are less about "words" and more about efficient embedding).

You definitely shouldn't have to re-annotate anything, though. Even if you do need to align your tokens manually, you can use existing libraries for this – for instance, this one, which we also use in spaCy: GitHub - explosion/tokenizations: Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/

mumud123 · March 3, 2021, 7:00am

Thanks for the response, it's good to hear that I don't need to re-annotate! I am using spacy v3.

Is prodigy tokenization only used to snap to token boundaries during annotation?

ines · March 3, 2021, 11:48pm

The tokenization it snaps to is the tokenization provided by the model/tokenizer used during annotation – so if you're using spaCy, this should match. There shouldn't be any significant changes in tokenization between v2 and v3. If you're running the debug data command with your exported data, how many misaligned spans does it point out? Command Line Interface · spaCy API Documentation

mumud123 · March 4, 2021, 4:53am

I have 27 misaligned tokens in the training data, which has ~14000 data samples. Is this because I annotated with spacy v2/prodigy but trained with spacy v3?

The error rate seems to be fine, but can I conceivably fix it with scripts in the library you shared? (I'm not sure if this is worth doing but want to know)

ines · March 4, 2021, 6:19am

Ah, okay, 27 in 14k examples definitely isn't very much – that's also not something you need alignment logic for (I just mentioned that because I initially thought your goal was to align linguistically-motivated tokens to wordpiece tokens). It might be worth looking at those specific examples to see what the difference is. I don't think we made any significant changes to the tokenization rules between v2 and v3, but even a very small adjustment or bug fix could potentially change certain examples. So maybe there's one specific pattern all of these spans have in common.

To find the examples, you're basically looking for spans where Doc.char_span with the given span's start and end index returns None.

mumud123 · March 4, 2021, 9:15pm

ok, that makes sense. thank you!

Topic		Replies	Views
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	553	March 27, 2020
Using xlm-roberta model for tokenization usage , transformers	3	1479	July 27, 2021
Tokenizer when training without base model training	3	503	December 14, 2022
Alignment of NER tokens when creating suggestions using Transformers ner	7	1068	August 12, 2022
Tokenization compatibility issues in rel.manual enhancement , usage , done , transformers , relations	7	1429	September 8, 2020

is there a way to change prodigy annotations to transformers-based annotations, without re-annotating?

Related topics