Help with tokenization numbers with units of measure

Hi! I just ran one of your examples through the default English tokenizer and you’re right, "5.1V dc"
is currently split as ['5.1V', 'dc']. So it probably makes sense to customise the tokenizer and build your own set of rules specific to your domain, starting with suffix rules that split off alphanumeric characters following numbers. This comment shows how you can extend the default expressions with your own, and save out a custom model.

The tokenizer rules are serialized with the model, and Prodigy supports loading in spaCy models from packages, links and directories – so you can just pass in the path of your model when you run a recipe:

prodigy ner.teach your_dataset /path/to/custom-model ...

Btw, you might also want to experiment with different combinations of rule-based and statistical approaches. For example, there are only so many units, right? If so, you could use the matcher to find them in your text, and then check whether the previous token has a number-like value. This thread and this thread both have some more details and examples of this approach.

Alternatively, spaCy’s pre-trained models already have a category ORDINAL, which you could try and fine-tune on your data. This might be quicker and more efficient, because you don’t need to teach the model all of this from scratch. The solution you choose ultimately depends on your data, and it’s difficult to predict which one works best. You might want to try out a few approaches and test them on the same evaluation set to find the one that produces the best results. (Prodigy can also help with that, since it lets you do quick binary evaluations.)