Hi! In general, spaCy is optimised around “real” text – e.g. sentences, paragraphs, real words. So you might find that you need to customise some of the tokenization rules to make sure your texts will actually be split into meaningful tokens. If you haven’t done this yet, I’d recommend running spaCy over some of your texts and checking whether the tokens match up with what you’re trying to label. For example, if the tokenizer produces
"ABC-123" as one token, but your entity is
"123", you won’t be able to train this effectively.
ner.manual recipe streams in your text and lets you label the entity tokens by hand. That’s often the safest way to go about annotating new entity types from scratch. But it’s not always the most efficient – so if you’re able to express examples of the entities with abstract token patterns (e.g. the token shape or whether it’s a number), you could also experiment with
ner.teach with patterns. This will pre-label examples that you can then accept or reject.
I actually just recorded a video the other day that discusses some of the trade-offs and how to decide which annotation mode to use:
You might also find @honnibal’s video on training a new entity type useful:
Finally, if the entities you’re trying to recognise are mostly combinations of letters/numbers etc., it might turn out that a rule-based approach with regular expressions or token patterns will always beat your statistical model in accuracy. So don’t be too disappointed if things don’t work out. But I hope Prodigy can make it easier to experiment with different approaches