Help with tokenization numbers with units of measure

ines · July 31, 2018, 4:57pm

Hi! I just ran one of your examples through the default English tokenizer and you’re right, "5.1V dc"
is currently split as ['5.1V', 'dc']. So it probably makes sense to customise the tokenizer and build your own set of rules specific to your domain, starting with suffix rules that split off alphanumeric characters following numbers. This comment shows how you can extend the default expressions with your own, and save out a custom model.

The tokenizer rules are serialized with the model, and Prodigy supports loading in spaCy models from packages, links and directories – so you can just pass in the path of your model when you run a recipe:

prodigy ner.teach your_dataset /path/to/custom-model ...

Btw, you might also want to experiment with different combinations of rule-based and statistical approaches. For example, there are only so many units, right? If so, you could use the matcher to find them in your text, and then check whether the previous token has a number-like value. This thread and this thread both have some more details and examples of this approach.

Alternatively, spaCy’s pre-trained models already have a category ORDINAL, which you could try and fine-tune on your data. This might be quicker and more efficient, because you don’t need to teach the model all of this from scratch. The solution you choose ultimately depends on your data, and it’s difficult to predict which one works best. You might want to try out a few approaches and test them on the same evaluation set to find the one that produces the best results. (Prodigy can also help with that, since it lets you do quick binary evaluations.)

Topic		Replies	Views
spaCy, prodigy, annotation usage , ner , solved	2	718	February 8, 2019
Good practice for unit separators and training usage , ner	2	371	September 15, 2021
Annotating strings without correct separation ner , best-practices	8	178	November 21, 2024
NER tagging in non-alphabetic language ner , spacy	1	408	May 2, 2022
Prodigy to Spacy Guide ner , spacy , best-practices	4	5304	January 13, 2020

Help with tokenization numbers with units of measure

Related topics