Providing NER token spans only (no character offsets)

henrye · August 9, 2019, 6:00pm

Hi!

I'm lazy (!) and providing character spans in the JSONL for NER annotation seems a little tedious. I've looked at the recipe but couldn't quite figure out, is it possible to just provide tokens spans only?

For example

{
    "text": "Hello Apple",
    "tokens": [
        { "text": "Hello", "id": 0 },
        { "text": "Apple", "id": 1 }
    ],
    "spans": [{ "label": "ORG", "token_start": 1, "token_end": 1 }]
}

ines · August 10, 2019, 11:29am

Hi! This isn't supported out-of-the-box, because then Prodigy would have to be in charge of aligning the tokenization. This can get pretty tricky, so at the moment, this shouldn't be something that the library does silently under the hood.

If you want, you can write a script like this using spaCy that takes the tokens to construct a new Doc object and then gets the character offsets of the spans into the doc.text by checking the span at doc[start:end]. Here's an example:

from spacy.tokens import Doc
from spacy.lang.en import English
nlp = English()  # mostly need this for the vocab

examples = []
for eg in your_data_here:
    words = [token["text"] for token in eg["tokens"]]
    doc = Doc(nlp.vocab, words=words)
    spans = []
    for span in eg["spans"]:
        doc_span = doc[span["token_start"]:span["token_end"] + 1]
        spans.append({
            "start": doc_span.start_char,
            "end": doc_span.end_char,
            "label": span["label"]
        })
     # We don't need tokens here anymore because we know
     # it matches spaCy's tokenization
     examples.append({"text": doc.text, "spans": spans})

However, there's one thing that this approach cannot solve for you: If you don't have character offsets and your tokenization doesn't preserve whitespace (like in your example), we have no way of knowing whether tokens are followed by a whitespace character or not and have to assume that they are. So the tokens "Hello", "Apple", "!" would become "Hello Apple !". If you have the whitespace information, you can pass in a list of boolean values as the spaces argument when constructing a Doc, e.g. spaces=[True, False, False].

If you don't have the whitespace information, you can try to reconstruct it from the original text by matching it up with the token texts and checking for trailing whitespace characters. I had to do this for our spacy-stanfordnlp wrapper and it looks something like this:

github.com

explosion/spacy-stanfordnlp/blob/aa7371165778cc281536491f572eb3ff71f3c5f3/spacy_stanfordnlp/language.py#L148-L170


for i, token in enumerate(tokens):
    span = text[offset:]
    if not len(span):
        break
    while len(span) and span[0].isspace():
        # If we encounter leading whitespace, skip one character ahead
        offset += 1
        span = text[offset:]
    words.append(token.text)
    # Make sure all strings are in the vocabulary
    pos.append(self.vocab.strings.add(token.upos or ""))
    tags.append(self.vocab.strings.add(token.xpos or ""))
    deps.append(self.vocab.strings.add(token.dependency_relation or ""))
    lemmas.append(self.vocab.strings.add(token.lemma or ""))
    offset += len(token.text)
    span = text[offset:]
    if i == len(tokens) - 1:
        spaces.append(False)
    elif not is_aligned:
        spaces.append(True)

This file has been truncated. show original

Finally, this approach is still going to fail if your tokenizer is destructive and doesn't follow a policy of preserving the original text. For instance, some tokenizers may output "I", "am" for the string "I'm". There are algorithms for handling this, but if your main objective is that you're lazy, I'm not sure you want to go down that rabbit hole

TL;DR: If your tokenization preserved whitespace information (whether the tokens are followed by a space or not) and is otherwise non-destructive, or if you don't care that much about preserving the original "text", use spaCy to auto-add the character offsets for you. Otherwise... it's more difficult.

henrye · August 12, 2019, 4:56pm

Ok awesome, thanks for the incredibly detailed guidance! I will give this a shot and see how it goes.

In this problem it's definitely simplified because all the inputs will be tokenized already and any new inputs will also be tokenized. Using the spacy code to get the character values seems easier than writing a new character counter, which is what I was hoping to avoid!

Topic		Replies	Views
Boundaries (token/offsets) on Ner annotations ner , database , solved	1	535	October 16, 2019
Token indices in NER jsonl format usage , ner , solved	1	535	May 20, 2019
Spans not lining up with text tokens, autopopulate NER usage , ner , solved	2	351	March 3, 2021
UI crashes on custom spans usage , ner , front-end , solved	6	406	May 27, 2021
Alignment of NER tokens when creating suggestions using Transformers ner	7	1068	August 12, 2022

Providing NER token spans only (no character offsets)

Related topics