Punctuation mark taken as decimal point by mistake (NER)

We would like to pre-annotate sentences for an NER task. Thereby the following JSON fails to read:

{"text": "Um fast 15 Prozent abwärts ging es mit F5.", "spans": [{"start": 39, "end": 41, "label": "MISC"}]}

The example can only be loaded after changing 41 to 42 - but the entity should be "F5" rather than "F5.". Reason seems to be the digit ("5") preceding the ".". Does someone know how to fix this behaviour?

Could you share the command that you're using? I'm assuming you're using a command that loads a spaCy pipeline to handle the tokenisation.

I'm guessing right now that the tokenizer is a bit confused because it's assuming that 5. is referring to a decimal number. This could be fixed with a custom tokeniser, but I'd like to reproduce the issue locally first.

The command to launch prodigy was...

prodigy ner.manual NER_Test blank:de f5.jsonl --label ORG,PER,LOC,MISC

with the example entry stored in f5.jsonl

Gotya! Yeah this seems to be an issue by the tokenizer. Here's what spaCy does under the hood.

import spacy

nlp = spacy.blank("de")

[t for t in nlp("Um fast 15 Prozent abwärts ging es mit F5.")]
# [Um, fast, 15, Prozent, abwärts, ging, es, mit, F5.]

It fails to split the "5" and the ".".

Fix #1

There are a few ways to go about handling this. One method could be to create a tokenizer that contains a special case. I could adapt the guide found here to your case.

from spacy.symbols import ORTH

special_case = [{ORTH: "F5"}, {ORTH: "."}]
nlp.tokenizer.add_special_case("F5.", special_case)

[t for t in nlp("Um fast 15 Prozent abwärts ging es mit F5.")]
# [Um, fast, 15, Prozent, abwärts, ging, es, mit, F5, .]

I could save this tokenizer locally.

nlp.to_disk("de_special")

And now, I can refer to this stored nlp model.

prodigy ner.manual NER_Test de_special f5.jsonl --label ORG,PER,LOC,MISC

This seems to run without errors on my machine.

Fix #2

This would work for this one instance, but I can imagine that it may fail again in another example. So instead it might be an option to write your own custom tokenizer. I'll copy the example on the docs below.

import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("de")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
nlp.to_disk("de_custom")

The example, as-is, doesn't fix your issue. But it does demonstrate a class that you can change however you see fit. This should give you the most flexibility because you could implement your own code with custom regexes/rules. It would be more work, but it would give you the most "general" solution.

1 Like

Many thanks Vincent for your detailed and helpful answer! Sounds good to apply Fix#1 immediately and start working on Fix#2.

A colleague of mine suggested that another alternative might also work for you. You might also customise the suffix_search of a Tokenizer, as explained here. This might allow you to only change a small part of a pre-existing tokeniser.

1 Like