Hello, I am a newbie in spaCy and I am struggling with the training of the POS tagger.
I am trying to train the POS tagger after customizing the tokenizer.
For example the tokenization of the text Il est culotté celui-là.
is now ['Il', 'est', 'culotté', 'celui-là', '.']
rather than the original one : ['Il', 'est', 'culotté', 'celui', '-', 'là', '.']
My problem is that nlp.update() doesn't seem to consider my customized tokenizer, since I can't annotate 'celui-là' as one token, but as 3 :
TRAIN_DATA = [
('celui-là', {'tags': ['PRON','PUNCT', 'PRON']})
]
However we can see that in the output the customized tokenizer is applied, so my conclusion is that I am training the tagger before applying the custom tokenizer.
Here are the code and output :
Do you know how to first apply my modifications of the tokenizer before the training of the tagger so I can train it with the right tokens ?