I still got a few questions though @magdaaniol :
1. Side-effects of more tokens
During labeling I see examples that would need even fine tokens. Especially in numbers.
Does it have any side-effects to produce even more fine-grained tokens? Especially on numbers? I imagine that when I break a number like "12345"
into ["1","2","3","4","5"]
that this somehow loses info?
2. tok2vec shows loss
In this post you said that the tok2vec layer is not trained. I did not fully comprehend why, but for me the training output is like this:
So I actually see a loss in the tok2vec column. Is this expected now?
3. Using custom tokenizer in ner.correct
After I trainined and save the model from the custom tokenizer as above, should I be able to used it in a ner.correct
recipe without problems like this?
prodigy ner.correct dataset_net ./models_custom_tok/model-best/ corpus/my_examples.jsonl --label LABEL1,LABEL2,LABEL3