For my case I need to preserve the hypen instead of splitting things there.
If we initialize Tokenizer with non default infix, prefix , suffix etc then it will not split at “-” and some other symbols where Spacy ideally split things - correct?
I am able to use this in the recent past
t= Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
with some punctuation symbols taken out from the regex (prefix_re, infix_re, suffix_re). That way I was able to keep post-secondary, co-op etc. as is/together.
I just purchased Prodigy so have not tried with the above. But the above should work - right if I can place the above in processing pipeline properly?
Thanks in advance.