I recently was starting a new NER annotation project on org named entities, and I realised that the default English tokenizer fails rather frequently on my corpus. As a result, some nuances that are important become impossible to capture. There are too many cases for me to enumerate, so I started thinking about the following possibilities:
- Using Prodigy to annotate spans in order to then train a tokenizer to replace the default tokenizer when annotating/training the NER model;
- Using a teach recipe somehow with the existing tokenizer in order to 'fix' the bad tokenizations.
As far as I can tell, it's not possible to run Prodigy without specifying some kind of tokenizer, so I wasn't sure whether this could be hacked by, say, providing a custom tokenizer with character-level tokenization, or some other approach.
What would be the recommended approach in this situation?