Tokenization compatibility issues in rel.manual

ines · July 28, 2020, 9:09am

If you're happy to just annotate the BPE tokens and relations between them, and don't care so much about aligning the tokens to spaCy's linguistic tokenization, you could also just load in pre-tokenized text using your tokenizer. Here's an example using a word piece tokenizer for NER annotation that aligns with a transformer model: https://prodi.gy/docs/named-entity-recognition#transformers-tokenizers

You don't have to do it within the recipe – you could also use the logic as a preprocessing step. One of the key parts here is to set the "ws" key on the tokens, a boolean indicating whether the token is followed by whitespace. Prodigy will use this in the UI to render less whitespace and preserve readability. The relations UI will still draw borders around the tokens, so it might be a bit less pretty for subword tokens – but you'll have alignment.

(Also, thanks for the kind words, glad to hear you like the new relations features )

Topic		Replies	Views
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	558	March 27, 2020
Skip mismatched tokenization? usage , ner , spacy , solved	2	399	February 8, 2022
Inconsistency in "token_end" in prodigy/spacy entities ner , spacy	2	623	March 26, 2019
ner.train on data not annotated by Spacy? ner	3	1152	June 11, 2018
Alignment of NER tokens when creating suggestions using Transformers ner	7	1076	August 12, 2022

Tokenization compatibility issues in rel.manual

Related topics