Help with tokenization numbers with units of measure

I have measurement values that I’ll be “learning” from. I’m labelling Number of units and unit of measure.

5.1V dc and 5.1 V dc should be labelled as:

  • Number of units = 5.1
  • Unit of measure = V dc

0.1mV ac and 0.1 mV ac should be labelled as:

  • Number of units = 0.1
  • Unit of measure = mV ac

As you can see, the Number of units may or may not have whitespace between itself and the Units of measure.

Could you please point me in the right direction?

Would I need a customer tokenizer to split them up? If so, any thoughts how best I would do that?

Hi! I just ran one of your examples through the default English tokenizer and you’re right, "5.1V dc"
is currently split as ['5.1V', 'dc']. So it probably makes sense to customise the tokenizer and build your own set of rules specific to your domain, starting with suffix rules that split off alphanumeric characters following numbers. This comment shows how you can extend the default expressions with your own, and save out a custom model.

The tokenizer rules are serialized with the model, and Prodigy supports loading in spaCy models from packages, links and directories – so you can just pass in the path of your model when you run a recipe:

prodigy ner.teach your_dataset /path/to/custom-model ...

Btw, you might also want to experiment with different combinations of rule-based and statistical approaches. For example, there are only so many units, right? If so, you could use the matcher to find them in your text, and then check whether the previous token has a number-like value. This thread and this thread both have some more details and examples of this approach.

Alternatively, spaCy’s pre-trained models already have a category ORDINAL, which you could try and fine-tune on your data. This might be quicker and more efficient, because you don’t need to teach the model all of this from scratch. The solution you choose ultimately depends on your data, and it’s difficult to predict which one works best. You might want to try out a few approaches and test them on the same evaluation set to find the one that produces the best results. (Prodigy can also help with that, since it lets you do quick binary evaluations.)

Hi ines,

Thanks for the reply. I was able to get the tokenization working with your guidance. I saved it as a custom model and then used prodigy to start some labeling. However, prodigy wasnt using the tokenizer that I created. I’m sure I’m missing something obvious.

Here’s what I did:

The code:

import spacy
from spacy.lang.en import English
import re
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en')
prefix_re = re.compile(r'''^[±\-\+0-9., ]+[0-9 ]+''')
infix_re = re.compile(r'''[/]''')

def my_tokenizer_pri(nlp):
  return Tokenizer(nlp.vocab,
    {},
    prefix_search=prefix_re.search,
    suffix_search=nlp.tokenizer.suffix_search,
    infix_finditer=infix_re.finditer,
    token_match=nlp.tokenizer.token_match)

nlp.tokenizer = my_tokenizer_pri(nlp)

nlp.to_disk('./models/en_custom')

print([w.text for w in nlp(u'0.1mV/km ac')])
print([w.text for w in nlp(u'3 → 12V')])

Output:

['0.1', 'mV', '/', 'km', 'ac']
['3', '→', '12', 'V']

I can also confirm the saved model does load and function as expected by doing this:

nlp = spacy.load('./models/en_custom')
print([w.text for w in nlp(u'0.1mV/km ac')])
print([w.text for w in nlp(u'3 → 12V')])

Output:

['0.1', 'mV', '/', 'km', 'ac']
['3', '→', '12', 'V']

Then I ran the following:

prodigy ner.manual volts ./models/en_custom volts.jsonl --label QUANTITY,UOM

But… I was presented with this:
image

My tokenizer (from within the code above) would have split this as: 1, mV, ac.

What am I missing?

Thanks

Thanks for the update – this looks really good so far!

And I’m confused about this example, too… Especially since Prodigy isn’t really doing anything magical here – it just uses the provided model to process and tokenize the text. So Prodigy’s tokenization should always match what you’re seeing if you’re using the model with spaCy directly.

Just as a sanity check, did you check that this exact string is tokenized as expected? Maybe there’s some edge case that your custom tokenizer is missing?