Anotation task format for ner_manual interface

ines · May 9, 2019, 1:41pm

Ah, okay, that makes sense then. Are you just using mark? I think I misread your initial question and thought you were using the built-in ner.manual recipe, which does take care of the tokenization automatically.

If you need your own custom tokens that align with your entity spans, then you also need to provide them. It might be worth writing a little script to check how many of the spans do not align – maybe it’s just one or two that you can easily correct manually (or exclude from your data).

An easy way to do this is to use spaCy’s Doc.char_span method, which creates a token span from character offsets. If the character offsets don’t align to the tokens, it returns None. So you can do something like this:

nlp = spacy.load("en_core_web_sm")  # or other model

for example in examples:  # your existing examples
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)

Topic		Replies	Views
ner.manual gives ValueError: Mismatched tokenization. usage , ner , solved	9	1415	August 1, 2019
Annotate text with multiple entities using ner_manual usage , ner	4	877	November 26, 2018
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Manual text typing usage , custom	2	932	February 25, 2018
Load pre-tagged entities ner.manual usage , ner , solved	8	1248	May 15, 2018

Anotation task format for ner_manual interface

Related topics