Hi, I went through making an NER pipeline with spacy and annotated some data. Now I want to try training a transformers-based model, but I see that the tokenization is different -- is it possible to convert my previous annotations to transformers tokenization, or do I need to re-do my annotations with the new tokenization?
Hi! How are your training your transformer-based model? If you're using spaCy v3, spaCy will take care of the tokenization alignment under the hood, and you'll be able to train from data with linguistically-motivated tokens, while still using the word piece tokens provided by the transformer (which are less about "words" and more about efficient embedding).
You definitely shouldn't have to re-annotate anything, though. Even if you do need to align your tokens manually, you can use existing libraries for this โ for instance, this one, which we also use in spaCy: GitHub - explosion/tokenizations: Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/
Thanks for the response, it's good to hear that I don't need to re-annotate! I am using spacy v3.
Is prodigy tokenization only used to snap to token boundaries during annotation?
The tokenization it snaps to is the tokenization provided by the model/tokenizer used during annotation โ so if you're using spaCy, this should match. There shouldn't be any significant changes in tokenization between v2 and v3. If you're running the debug data
command with your exported data, how many misaligned spans does it point out? Command Line Interface ยท spaCy API Documentation
I have 27 misaligned tokens in the training data, which has ~14000 data samples. Is this because I annotated with spacy v2/prodigy but trained with spacy v3?
The error rate seems to be fine, but can I conceivably fix it with scripts in the library you shared? (I'm not sure if this is worth doing but want to know)
Ah, okay, 27 in 14k examples definitely isn't very much โ that's also not something you need alignment logic for (I just mentioned that because I initially thought your goal was to align linguistically-motivated tokens to wordpiece tokens). It might be worth looking at those specific examples to see what the difference is. I don't think we made any significant changes to the tokenization rules between v2 and v3, but even a very small adjustment or bug fix could potentially change certain examples. So maybe there's one specific pattern all of these spans have in common.
To find the examples, you're basically looking for spans where Doc.char_span
with the given span's start and end index returns None
.
ok, that makes sense. thank you!