Hi ines,
Thanks for the reply. I was able to get the tokenization working with your guidance. I saved it as a custom model and then used prodigy to start some labeling. However, prodigy wasnt using the tokenizer that I created. I’m sure I’m missing something obvious.
Here’s what I did:
The code:
import spacy
from spacy.lang.en import English
import re
from spacy.tokenizer import Tokenizer
nlp = spacy.load('en')
prefix_re = re.compile(r'''^[±\-\+0-9., ]+[0-9 ]+''')
infix_re = re.compile(r'''[/]''')
def my_tokenizer_pri(nlp):
return Tokenizer(nlp.vocab,
{},
prefix_search=prefix_re.search,
suffix_search=nlp.tokenizer.suffix_search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match)
nlp.tokenizer = my_tokenizer_pri(nlp)
nlp.to_disk('./models/en_custom')
print([w.text for w in nlp(u'0.1mV/km ac')])
print([w.text for w in nlp(u'3 → 12V')])
Output:
['0.1', 'mV', '/', 'km', 'ac']
['3', '→', '12', 'V']
I can also confirm the saved model does load and function as expected by doing this:
nlp = spacy.load('./models/en_custom')
print([w.text for w in nlp(u'0.1mV/km ac')])
print([w.text for w in nlp(u'3 → 12V')])
Output:
['0.1', 'mV', '/', 'km', 'ac']
['3', '→', '12', 'V']
Then I ran the following:
prodigy ner.manual volts ./models/en_custom volts.jsonl --label QUANTITY,UOM
But… I was presented with this:
My tokenizer (from within the code above) would have split this as: 1
, mV
, ac
.
What am I missing?
Thanks