Help with tokenization numbers with units of measure

drem · July 30, 2018, 9:28pm

I have measurement values that I’ll be “learning” from. I’m labelling Number of units and unit of measure.

5.1V dc and 5.1 V dc should be labelled as:

Number of units = 5.1
Unit of measure = V dc

0.1mV ac and 0.1 mV ac should be labelled as:

Number of units = 0.1
Unit of measure = mV ac

As you can see, the Number of units may or may not have whitespace between itself and the Units of measure.

Could you please point me in the right direction?

Would I need a customer tokenizer to split them up? If so, any thoughts how best I would do that?

ines · July 31, 2018, 4:57pm

Hi! I just ran one of your examples through the default English tokenizer and you’re right, "5.1V dc"
is currently split as ['5.1V', 'dc']. So it probably makes sense to customise the tokenizer and build your own set of rules specific to your domain, starting with suffix rules that split off alphanumeric characters following numbers. This comment shows how you can extend the default expressions with your own, and save out a custom model.

The tokenizer rules are serialized with the model, and Prodigy supports loading in spaCy models from packages, links and directories – so you can just pass in the path of your model when you run a recipe:

prodigy ner.teach your_dataset /path/to/custom-model ...

Btw, you might also want to experiment with different combinations of rule-based and statistical approaches. For example, there are only so many units, right? If so, you could use the matcher to find them in your text, and then check whether the previous token has a number-like value. This thread and this thread both have some more details and examples of this approach.

Alternatively, spaCy’s pre-trained models already have a category ORDINAL, which you could try and fine-tune on your data. This might be quicker and more efficient, because you don’t need to teach the model all of this from scratch. The solution you choose ultimately depends on your data, and it’s difficult to predict which one works best. You might want to try out a few approaches and test them on the same evaluation set to find the one that produces the best results. (Prodigy can also help with that, since it lets you do quick binary evaluations.)

drem · August 6, 2018, 2:44am

Hi ines,

Thanks for the reply. I was able to get the tokenization working with your guidance. I saved it as a custom model and then used prodigy to start some labeling. However, prodigy wasnt using the tokenizer that I created. I’m sure I’m missing something obvious.

Here’s what I did:

The code:

import spacy
from spacy.lang.en import English
import re
from spacy.tokenizer import Tokenizer

nlp = spacy.load('en')
prefix_re = re.compile(r'''^[±\-\+0-9., ]+[0-9 ]+''')
infix_re = re.compile(r'''[/]''')

def my_tokenizer_pri(nlp):
  return Tokenizer(nlp.vocab,
    {},
    prefix_search=prefix_re.search,
    suffix_search=nlp.tokenizer.suffix_search,
    infix_finditer=infix_re.finditer,
    token_match=nlp.tokenizer.token_match)

nlp.tokenizer = my_tokenizer_pri(nlp)

nlp.to_disk('./models/en_custom')

print([w.text for w in nlp(u'0.1mV/km ac')])
print([w.text for w in nlp(u'3 → 12V')])

Output:

['0.1', 'mV', '/', 'km', 'ac']
['3', '→', '12', 'V']

I can also confirm the saved model does load and function as expected by doing this:

nlp = spacy.load('./models/en_custom')
print([w.text for w in nlp(u'0.1mV/km ac')])
print([w.text for w in nlp(u'3 → 12V')])

Output:

['0.1', 'mV', '/', 'km', 'ac']
['3', '→', '12', 'V']

Then I ran the following:

prodigy ner.manual volts ./models/en_custom volts.jsonl --label QUANTITY,UOM

But… I was presented with this:

My tokenizer (from within the code above) would have split this as: 1, mV, ac.

What am I missing?

Thanks

ines · August 6, 2018, 8:57am

Thanks for the update – this looks really good so far!

And I’m confused about this example, too… Especially since Prodigy isn’t really doing anything magical here – it just uses the provided model to process and tokenize the text. So Prodigy’s tokenization should always match what you’re seeing if you’re using the model with spaCy directly.

Just as a sanity check, did you check that this exact string is tokenized as expected? Maybe there’s some edge case that your custom tokenizer is missing?

Topic		Replies	Views
Custom English Tokenizer usage , spacy	0	533	May 7, 2019
Add tokenization rule usage , spacy	4	738	May 15, 2020
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1348	March 11, 2020
How to modify the tokenizer used by Prodigy's recipes? usage , spacy	2	1064	March 27, 2018
Guidance on how to add tokenizer rule spacy , solved	3	3396	July 3, 2018

Help with tokenization numbers with units of measure

Related topics