Gotya! Yeah this seems to be an issue by the tokenizer. Here's what spaCy does under the hood.
import spacy
nlp = spacy.blank("de")
[t for t in nlp("Um fast 15 Prozent abwärts ging es mit F5.")]
# [Um, fast, 15, Prozent, abwärts, ging, es, mit, F5.]
It fails to split the "5"
and the "."
.
Fix #1
There are a few ways to go about handling this. One method could be to create a tokenizer that contains a special case. I could adapt the guide found here to your case.
from spacy.symbols import ORTH
special_case = [{ORTH: "F5"}, {ORTH: "."}]
nlp.tokenizer.add_special_case("F5.", special_case)
[t for t in nlp("Um fast 15 Prozent abwärts ging es mit F5.")]
# [Um, fast, 15, Prozent, abwärts, ging, es, mit, F5, .]
I could save this tokenizer locally.
nlp.to_disk("de_special")
And now, I can refer to this stored nlp
model.
prodigy ner.manual NER_Test de_special f5.jsonl --label ORG,PER,LOC,MISC
This seems to run without errors on my machine.
Fix #2
This would work for this one instance, but I can imagine that it may fail again in another example. So instead it might be an option to write your own custom tokenizer. I'll copy the example on the docs below.
import spacy
from spacy.tokens import Doc
class WhitespaceTokenizer:
def __init__(self, vocab):
self.vocab = vocab
def __call__(self, text):
words = text.split(" ")
spaces = [True] * len(words)
# Avoid zero-length tokens
for i, word in enumerate(words):
if word == "":
words[i] = " "
spaces[i] = False
# Remove the final trailing space
if words[-1] == " ":
words = words[0:-1]
spaces = spaces[0:-1]
else:
spaces[-1] = False
return Doc(self.vocab, words=words, spaces=spaces)
nlp = spacy.blank("de")
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
nlp.to_disk("de_custom")
The example, as-is, doesn't fix your issue. But it does demonstrate a class that you can change however you see fit. This should give you the most flexibility because you could implement your own code with custom regexes/rules. It would be more work, but it would give you the most "general" solution.