Tokenization compatibility issues in rel.manual

First off, thank you for the beautiful relation annotation tool -- we're really enjoying it's UX and the efficiency with which we can label and visualize relations!!!

We have a transformer-based model that jointly predicts entities and relations. This model uses a byte-pair encoding tokenization, that it quite similar to what SpaCy is doing.. But not similar enough.

When making predictions, a json similar to the rel.manual dataset output is created-- It works well a lot of the time when loaded into Prodigy but there are lots of issues where doc.char_span(start_index, stop_index) returns none -- eg when BPE splits in the middle of a SpaCy token, and front end index errors when SpaCy tokenization in Prodigy doesn't split as much as BPE tokenizer and our token offset exceed the max position in the PRodigy Spacy doc.

To fix this I've tried - using the same SpaCy model (en_core_web_lg) as a pretokenizer for BPE, and tried using the spacy.gold.align on normal SpaCy tokenized input with the BPE encoded input to set the span start and stop on the transformer side.
This last effort works but it's a lot of work to feed into Prodigy and it causes errors when SpaCy merges tokens that shouldn't be merged.

It's also worth noting that I initially supplied tokens, spans and relations from the BPE sheme but these wouldn't load properly.

So first of all , this is a pain-point -- I don't think it makes sense to try to couple the tokenization between the predictive model and the Prodigy model, even using the same SpaCy models I'm getting differences in tokenization behavior, we may want to try different tokenization scheme on either end and there isn't a straightforward way to ensure compatibility.

So I'm thinking the best way forward (because the other approaches don't really work) is to use the span character offset positions to snap to SpaCy tokens on the prodigy side. Is there a good function to just get the SpaCy token given a character index? Maybe such an approach could be integrated into Prodigy as a loading option to increase compatibility with non-SpaCy-based approacher

This comes up a lot when training NER models from data where the tokenization is slightly different from spacy's tokenization. Currently, unless you adjust your data, those instances are ignored while training, and I'm not as familiar with prodigy, but I would guess you see warnings and those instances aren't displayed?

I've thought about adding options to have doc.char_span either snap inside or outside to the nearest token boundary instead of returning None. For named entities there is the possibility you might end up with with instances that are wrong (no longer a named entity, no longer the right type due to nesting), but there are lots of cases where it would be useful and better overall than discarding data.

There isn't currently a good function to get the token given a character offset. Even doc.char_span just loops through the tokens to see if one matches the start/end, which is kind of dumb.

But to back up a step: what are your goals here with prodigy? Are you only doing manual annotation? Do you care whether the annotation produced by prodigy aligns with the original BPE tokens? Which BPE tokenizer are you using?

If you're happy to just annotate the BPE tokens and relations between them, and don't care so much about aligning the tokens to spaCy's linguistic tokenization, you could also just load in pre-tokenized text using your tokenizer. Here's an example using a word piece tokenizer for NER annotation that aligns with a transformer model: https://prodi.gy/docs/named-entity-recognition#transformers-tokenizers

You don't have to do it within the recipe – you could also use the logic as a preprocessing step. One of the key parts here is to set the "ws" key on the tokens, a boolean indicating whether the token is followed by whitespace. Prodigy will use this in the UI to render less whitespace and preserve readability. The relations UI will still draw borders around the tokens, so it might be a bit less pretty for subword tokens – but you'll have alignment.

(Also, thanks for the kind words, glad to hear you like the new relations features :blush:)

Hmmm. this is very close to what I was originally doing, but on the model side, I was creating a token array with ws attribute -- but it didn't seem to be using my tokens array when loading the JSONLines back into Prodigy.

In case this helps anyone, here is the function I'm using to "snap" the subword tokens to SpaCy tokens

def bpe_token_to_spacy_characters(doc: Doc, encoding: Encoding):
    """Returns a lookup table of bpe_token indexes to spacy token character offsets

    Args:
        doc:
        encoding:

    Returns:
        token_to_spacy_word
        token_to_spacy_character -- snapped to the spacy token

    """

    spacy_tokens = [token.text for token in doc]
    tokenizer_tokens = [e.replace("</w>", "") for e in encoding.tokens]

    spacy_words = []
    spacy_characters = []
    cost, spacy2tok, tok2spacy, spacy2tok_multi, tok2spacy_multi = align(
        spacy_tokens, tokenizer_tokens
    )

    spacy2tok_inverse = defaultdict(list)
    for k, v in spacy2tok_multi.items():
        spacy2tok_inverse[v].append(k)

    for i in range(len(tokenizer_tokens)):
        spacy_word = tok2spacy[i]
        if spacy_word == -1:
            if i in tok2spacy_multi:
                spacy_token_offset = (tok2spacy_multi[i], tok2spacy_multi[i] + 1)
            else:
                multi_tokens = sorted(spacy2tok_inverse[i])
                spacy_token_offset = (multi_tokens[0], multi_tokens[-1] + 1)
        else:
            spacy_token_offset = (spacy_word, spacy_word + 1)
        # calculate character offsets
        spacy_tokens = doc[spacy_token_offset[0] : spacy_token_offset[1]]
        min_char = spacy_tokens[0].idx
        end_char = spacy_tokens[-1].idx + len(spacy_tokens[-1])
        spacy_words.append(spacy_token_offset)
        spacy_characters.append((min_char, end_char))
        # check that the string can be extracted
        spacy_span_str = doc.char_span(min_char, end_char)
        assert spacy_span_str is not None

    return spacy_words, spacy_characters

It basically gives an offset map of tokens to spacy tokens and tokens to spacy characters.

I'll have to circle back to this but for now we at least have in place the virtuous cycle of pre-annotation of relations :slight_smile:

OK I see where I went wrong. I made a "tokens" array for my examples, similar to what is shown here, https://prodi.gy/docs/dependencies-relations#custom-model, but I was assuming that rel.manual would process the tokens array if present. It looks that doesn't happen, but there is helpful pseudocode:

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.preprocess import add_tokens
import spacy

@prodigy.recipe("custom-dep")
def custom_dep_recipe(dataset, source):
    stream = JSONL(source)                          # load the data
    stream = add_relations_to_stream(stream)        # add custom relations
    stream = add_tokens(spacy.blank("en"), stream)  # add "tokens" to stream

    return {
        "dataset": dataset,      # dataset to save annotations to
        "stream": stream,        # the incoming stream of examples
        "view_id": "relations",  # annotation interface to use
        "labels": ["ROOT", "nsubj", "amod", "dobj"]  # labels to annotate
    }

So it looks like the key is to call the add_tokens function on the stream.

Ahhh, you're right, sorry about the confusion! The add_tokens preprocessor does indeed respect pre-defined spans but the rel.manual recipe uses its own more complex logic to add tokens and translate spans and pattern matches.

It should definitely make it easier to supply custom tokenization – the easiest way to make this happen would probably be to add an option to create spaCy Doc objects given the "tokens" (which is no problem if they specify the "ws" information). Then we can use them for matching and make sure everything is consistent. I'll add this to my enhancement list for the next version :slightly_smiling_face:

1 Like

It all makes sense, now.

@adriane -- I do think that doc.char_span could have (at least) an option to snap to token boundaries to the outside of spacy tokens. It would also be nice to have a fast char_to_token lookup.
The tokenizer I'm using is a FastBytePairEncoding from :hugs: tokenizer library.

I ultimately I think I'll do something similar to the BERT wordpiece NER example in a custom recipe and precompute all the params.

Thanks for the support.