Providing NER token spans only (no character offsets)

Hi!

I'm lazy (!) and providing character spans in the JSONL for NER annotation seems a little tedious. I've looked at the recipe but couldn't quite figure out, is it possible to just provide tokens spans only?

For example

{
    "text": "Hello Apple",
    "tokens": [
        { "text": "Hello", "id": 0 },
        { "text": "Apple", "id": 1 }
    ],
    "spans": [{ "label": "ORG", "token_start": 1, "token_end": 1 }]
}

Hi! This isn't supported out-of-the-box, because then Prodigy would have to be in charge of aligning the tokenization. This can get pretty tricky, so at the moment, this shouldn't be something that the library does silently under the hood.

If you want, you can write a script like this using spaCy that takes the tokens to construct a new Doc object and then gets the character offsets of the spans into the doc.text by checking the span at doc[start:end]. Here's an example:

from spacy.tokens import Doc
from spacy.lang.en import English
nlp = English()  # mostly need this for the vocab

examples = []
for eg in your_data_here:
    words = [token["text"] for token in eg["tokens"]]
    doc = Doc(nlp.vocab, words=words)
    spans = []
    for span in eg["spans"]:
        doc_span = doc[span["token_start"]:span["token_end"] + 1]
        spans.append({
            "start": doc_span.start_char,
            "end": doc_span.end_char,
            "label": span["label"]
        })
     # We don't need tokens here anymore because we know
     # it matches spaCy's tokenization
     examples.append({"text": doc.text, "spans": spans})

However, there's one thing that this approach cannot solve for you: If you don't have character offsets and your tokenization doesn't preserve whitespace (like in your example), we have no way of knowing whether tokens are followed by a whitespace character or not and have to assume that they are. So the tokens "Hello", "Apple", "!" would become "Hello Apple !". If you have the whitespace information, you can pass in a list of boolean values as the spaces argument when constructing a Doc, e.g. spaces=[True, False, False].

If you don't have the whitespace information, you can try to reconstruct it from the original text by matching it up with the token texts and checking for trailing whitespace characters. I had to do this for our spacy-stanfordnlp wrapper and it looks something like this:

Finally, this approach is still going to fail if your tokenizer is destructive and doesn't follow a policy of preserving the original text. For instance, some tokenizers may output "I", "am" for the string "I'm". There are algorithms for handling this, but if your main objective is that you're lazy, I'm not sure you want to go down that rabbit hole :wink:

TL;DR: If your tokenization preserved whitespace information (whether the tokens are followed by a space or not) and is otherwise non-destructive, or if you don't care that much about preserving the original "text", use spaCy to auto-add the character offsets for you. Otherwise... it's more difficult.

Ok awesome, thanks for the incredibly detailed guidance! I will give this a shot and see how it goes.

In this problem it's definitely simplified because all the inputs will be tokenized already and any new inputs will also be tokenized. Using the spacy code to get the character values seems easier than writing a new character counter, which is what I was hoping to avoid!

1 Like