Welcome to the forum @fernandorodriguespro ,
In ner.manual
and spans.manual
recipes you can add a toggle for switching between character and token-based annotation by calling them with the --highlight-chars
flag. Be mindful, though, that this won't affect the tokenization so you'll end up with misaligned tokens and spans. In order to train a model afterwards, you'll need a tokenizer that will be able to recreate this same tokenization (See the important note in the docs referred above)
The typical workflow in these cases is to use this highlight-chars
feature to "record" the issues in the tokenization of your dataset and use them to create custom tokenization rules for the tokenizer that will be used in the training.
You can also use the manually retokenized phrases to test the custom tokenizer that you will create.
From there you might integrate the custom tokenizer in your training pipeline or you might also choose to solve these issues in the preprocessing step (as this is where this problem essentially belongs) - it shouldn't really matter.