Punctuation mark taken as decimal point by mistake (NER)

TJC · May 16, 2022, 10:53am

We would like to pre-annotate sentences for an NER task. Thereby the following JSON fails to read:

{"text": "Um fast 15 Prozent abwärts ging es mit F5.", "spans": [{"start": 39, "end": 41, "label": "MISC"}]}

The example can only be loaded after changing 41 to 42 - but the entity should be "F5" rather than "F5.". Reason seems to be the digit ("5") preceding the ".". Does someone know how to fix this behaviour?

koaning · May 16, 2022, 10:57am

Could you share the command that you're using? I'm assuming you're using a command that loads a spaCy pipeline to handle the tokenisation.

I'm guessing right now that the tokenizer is a bit confused because it's assuming that 5. is referring to a decimal number. This could be fixed with a custom tokeniser, but I'd like to reproduce the issue locally first.

TJC · May 16, 2022, 11:12am

The command to launch prodigy was...

prodigy ner.manual NER_Test blank:de f5.jsonl --label ORG,PER,LOC,MISC

with the example entry stored in f5.jsonl

koaning · May 16, 2022, 4:12pm

Gotya! Yeah this seems to be an issue by the tokenizer. Here's what spaCy does under the hood.

import spacy

nlp = spacy.blank("de")

[t for t in nlp("Um fast 15 Prozent abwärts ging es mit F5.")]
# [Um, fast, 15, Prozent, abwärts, ging, es, mit, F5.]

It fails to split the "5" and the ".".

Fix #1

There are a few ways to go about handling this. One method could be to create a tokenizer that contains a special case. I could adapt the guide found here to your case.

from spacy.symbols import ORTH

special_case = [{ORTH: "F5"}, {ORTH: "."}]
nlp.tokenizer.add_special_case("F5.", special_case)

[t for t in nlp("Um fast 15 Prozent abwärts ging es mit F5.")]
# [Um, fast, 15, Prozent, abwärts, ging, es, mit, F5, .]

I could save this tokenizer locally.

nlp.to_disk("de_special")

And now, I can refer to this stored nlp model.

prodigy ner.manual NER_Test de_special f5.jsonl --label ORG,PER,LOC,MISC

This seems to run without errors on my machine.

Fix #2

This would work for this one instance, but I can imagine that it may fail again in another example. So instead it might be an option to write your own custom tokenizer. I'll copy the example on the docs below.

import spacy
from spacy.tokens import Doc

class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab

    def __call__(self, text):
        words = text.split(" ")
        spaces = [True] * len(words)
        # Avoid zero-length tokens
        for i, word in enumerate(words):
            if word == "":
                words[i] = " "
                spaces[i] = False
        # Remove the final trailing space
        if words[-1] == " ":
            words = words[0:-1]
            spaces = spaces[0:-1]
        else:
           spaces[-1] = False

        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("de")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
nlp.to_disk("de_custom")

The example, as-is, doesn't fix your issue. But it does demonstrate a class that you can change however you see fit. This should give you the most flexibility because you could implement your own code with custom regexes/rules. It would be more work, but it would give you the most "general" solution.

TJC · May 18, 2022, 12:06pm

Many thanks Vincent for your detailed and helpful answer! Sounds good to apply Fix#1 immediately and start working on Fix#2.

koaning · May 20, 2022, 12:04pm

A colleague of mine suggested that another alternative might also work for you. You might also customise the suffix_search of a Tokenizer, as explained here. This might allow you to only change a small part of a pre-existing tokeniser.

TJC · September 14, 2022, 5:06pm

For the sake of completeness: The following code did the trick (Tokenizer customisation, as suggested above):

import spacy

nlp_blank_de = spacy.blank("de")

nlp_blank_de_suffixes = nlp_blank_de.Defaults.suffixes
nlp_blank_de_suffixes = nlp_blank_de_suffixes + [r'''\.$''']

nlp_blank_de_regex = spacy.util.compile_suffix_regex(nlp_blank_de_suffixes)

nlp_blank_de.tokenizer.suffix_search = nlp_blank_de_regex.search

nlp_blank_de.to_disk("de_punctuation")

Topic		Replies	Views
Skip mismatched tokenization? usage , ner , spacy , solved	2	395	February 8, 2022
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	552	March 27, 2020
Annotating strings without correct separation ner , best-practices	8	192	November 21, 2024
spaCy, prodigy, annotation usage , ner , solved	2	721	February 8, 2019
Insert Exception to skip cases where tokens are misaligned. usage , ner , spacy	1	479	October 12, 2020

Punctuation mark taken as decimal point by mistake (NER)

Fix #1

Fix #2

Related topics