There's two problems with this approach. First, if you didn't do character-based highlighting the first time, you'd need to redo your annotations.
However, the second problem is worse because even if you do character-based annotations, NER models are built for token-based tags, not character. Here's more details:
But as the docs show, this capability is really for other languages (e.g., Chinese) where characters represent tokens, not token-based languages like English for training ner models.
Also, the same post explains why the --highlight-chars isn't available for ner.correct or ner.teach:
Therefore, I would not recommend this approach.
Option 2: Create a modified tokenizer
I would recommend modifying your tokenizer so that you keep your annotations as tokens but that it appropriately tokenizes items as you'd want them. This will require a little knowledge of spaCy's tokenizer but there is documentation. What I would recommend is find a handful of examples that you've noticed the current tokenizer doesn't work. Then create a modified tokenizer that performs to your liking. Save that tokenizer, and then use that tokenizer for all parts of your labeling workflow: ner.manual, ner.correct, etc.
The downside is you'll need to redo your initial annotations. While you may be tempted to use your original annotations that used the default tokenizer, you'll likely run into a a problem as a small % of annotations are tokenized inconsistently (e.g., ner.manual used default tokenizer while your ner.correct uses your modified tokenizer). Alternatively, if you're good with python, you could try to identify which annotations the tokenizer would have a different behavior, relabel only those in ner.manual, then proceed.
The key message is keep 1 and only 1 tokenizer throughout your entire annotation/training process. This is echoed in the docs:
When using character-based highlighting, annotation may be slower and there’s no guarantee that the spans you annotate map to actual tokens later on. If your goal is to train a named entity recognizer, you should consider using the same tokenizer during annotation, to make sure that your data can be used.
Option 3: Pre-process/clean data and still use default tokenizer
The other option is to try to do "pre-process" the text, e.g., add in white space manually, to "trick" the spaCy default tokenizer to perform as you would want. For example, "unit 1,32 fake road" -> "unit 1, 32 fake road". I would caution against this as ideally you embed pre-process steps into spaCy's pipeline (e.g., through the tokenizer) so you're always processing raw data into your spaCy pipeline.
Let me know if this helps or if you have other questions!