Anotation task format for ner_manual interface

Ah, okay, that makes sense then. Are you just using mark? I think I misread your initial question and thought you were using the built-in ner.manual recipe, which does take care of the tokenization automatically.

If you need your own custom tokens that align with your entity spans, then you also need to provide them. It might be worth writing a little script to check how many of the spans do not align – maybe it’s just one or two that you can easily correct manually (or exclude from your data).

An easy way to do this is to use spaCy’s Doc.char_span method, which creates a token span from character offsets. If the character offsets don’t align to the tokens, it returns None. So you can do something like this:

nlp = spacy.load("en_core_web_sm")  # or other model

for example in examples:  # your existing examples
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)
1 Like