I’d like to highlight a custom entity called Location. I am having trouble using ner.manual to annotate certain sentences like “it should be44 W 5th St” When I try to highlight ‘44 W 5th St’ it also picks up the ‘be’. As a result, I end up skipping this annotation because I don’t want the model to learn ‘be44 W 5th St.’
Is there anything in the works to address this issue?
The reason here is likely that the tokenizer you’re using doesn’t split the text in a way that produces the entity spans: "be44" remains one token, so no token "44" exists, and no entity span can be created for it.
The manual NER interface uses pre-tokenized text, to make it easier to highlight things (the selection can “snap” to the token boundaries), and also to make issues like this more obvious, and allow you to adjust the tokenization if needed. If you were to train a model with spaCy using annotations that don’t map to valid tokens, the model won’t be able to learn anything meaningful from them, because it’ll never actually produce those tokens.
If you are using spaCy, one solution would be to adjust the tokenization rules. For example, you might want to consider adding a rule that always splits numbers following letters, if that’s common in your data. Or if this is just a single stray example, you can also just skip it.
If you’re not using spaCy, you can also always provide your own tokenization via the "tokens" key in the data – see the “Annotation taks formats” section in your PRODIGY_README.html for details.