Annotation with WordPiece tokens

I'm trying to use Prodigy for annotation for an external (i.e. non-spaCy model). I'm having difficulty generating files that are both compatible with Prodigy and my model.

I'm using a Huggingface tokenizer, the token text doesn't always correspond to the document text based on the span indexes (i.e. the token text is lowercase, and may include prefixes such as ## etc).

The problem I run into is Prodigy seems to reconstruct the text for annotation using the token text, rather than using the start/end offset into the actual text in the jsonl file. Is there a way I can get prodigy to display the correct text without modifying the jsonl file?

My recipe sets the disabled and whitespace fields, and works fine provided example['text] [start:end]== token['text'] but when it's clearly concatenating token['text] since it's lower case using the actual output from the tokenizer.

I guess I could replace token['text'] with example['text] [start:end] in the recipe and store the original tokenizer text in an temporary field but this seems a little hacky?

Hi! If the tokenizer doesn't preserve the original input, that's definitely unfortunate and in that case, I think it makes sense to store the output text and original text separately. So I think the best option would be to make the example "text" the final text (i.e. concatenated token texts) produced by the tokenizer, and then store the original text separately for reference? This way, you'd be annotating and working with exactly what the tokenizer outputs.

If you do want to annotate a text representation that's slightly different from the actual word piece tokens, you could also have a separate alignment step as a post-process. But that's potentially more of a hassle and really only worth it if the raw representations are too inconvenient to work with. Here's the alignment library that we use in spacy-transformers: GitHub - explosion/tokenizations: Robust and Fast tokenizations alignment library for Rust and Python

I think I'd have to do it the other way around. Because my tokenizer splits on punctuation and camel case (my text is technical sensor network labels) then converts to lowercase before finally running through a statistical model (i.e. BPE or Unigram) if you concatenate the tokens it's next to impossible to read (there's no whitespace, hence the reason for the punctuation and camelcase in the first place).

Ideally if you passed a token to prodigy which consisted only of offsets but no actual text it would use the index into the examples text field instead to extract the original text?

I'll look at the alignment library

In the ner_manual UI, Prodigy will only use the "tokens". The spans you generate will be based on the start/end/ID information included in the respective tokens. But this also means that you can definitely use the tokens to customise the way they're rendered in the UI. You could also make them include additional information about their "true" offsets into the text, or how they map to the original wordpiece tokens produced by the tokenizer.