I recently was starting a new NER annotation project on org named entities, and I realised that the default English tokenizer fails rather frequently on my corpus. As a result, some nuances that are important become impossible to capture. There are too many cases for me to enumerate, so I started thinking about the following possibilities:
Using Prodigy to annotate spans in order to then train a tokenizer to replace the default tokenizer when annotating/training the NER model;
Using a teach recipe somehow with the existing tokenizer in order to 'fix' the bad tokenizations.
As far as I can tell, it's not possible to run Prodigy without specifying some kind of tokenizer, so I wasn't sure whether this could be hacked by, say, providing a custom tokenizer with character-level tokenization, or some other approach.
What would be the recommended approach in this situation?
By default, Prodigy will use the tokenizer of whichever spaCy model you load in. But you can also load in your own spaCy model package with a custom tokenizer, or start with pre-tokenized data with a "tokens" property. See here for an example of the data format.
So if you have a tokenizer that works well for your specific data, that'd be the easiest solution. If you don't have a tokenizer that works well, you could try and customize spaCy's tokenization rules or add custom merging/splitting logic for very specific cases that can't be expressed with rules. (You could also use Prodigy to collect the special cases that are currently not tokenized the way you want them to be tokenized – for instance, using the ner.manual recipe and maybe some custom CSS that adds a border around each token container. You can then highlight all tokens with the wrong tokenization, export the data and compile a list of special cases to focus on.)
If you want to actually train a tokenizer, you could use an existing neural tokenizer implementation, train that and then use it to pre-tokenize your data before annotation. However, this will obviously be significantly slower at runtime.
Thanks for the reply. So, to be more specific, there is no way to 'teach' the spacy default english tokenizer by using prodigy to fix tokenizations or reject incorrect tokenizations?
For my data, the spacy default english tokenizer is wrong maybe 20% of the time, so I don't want to spend time collecting hundreds or thousands of cases for rules. It would have been really convenient to be able to use the english tokenizer as a baseline model, and update it through prodigy directly. After all, the tokenizer can play a big role in the final results of these types of pipelines, so it might be nice to allow the tokenizer to be customised/improved much like the NER models...
I would like to avoid having to go back to BRAT just to do the tokenization annotations.
Well, spaCy's tokenizer isn't "teachable" in that sense, because it's rule-based, not statistical. That's what makes it so fast and relatively easy to adjust for custom use cases, which is indeed very important.
The "rules" in this case aren't just lists of special cases and how to tokenize them – they describe how to split prefixes, suffixes and infixes and a few exceptions. You can read more about how that works here. You can easily add and remove prefix/infix/suffix rules or add special cases, but it wouldn't make sense to do this during annotation, because you want to evaluate changes on all of your data to make sure there are no unintended side-effects. So you typically want to collect the cases it currently fails on and work out how to best adjust the tokenization rules for the best accuracy tradeoff.
(For instance, there might be special characters you never want to split on and removing them from the infixes could give you a 10% increase in accuracy on those cases with a -1% regression on some other special cases, which is a good tradeoff overall. And then there might be cases you want to hard-code, because theres no consistent general-purpose policy for them.)