I have a follow-up question to this topic. I used
data-to-spacy with the following parameters
prodigy data-to-spacy output --ner my_dataset --ner-missing --base-model output/my_model/model-last -F functions.py
--base-model because my data was collected using a custom tokenizer. I used
-F to point prodigy to that tokenizer although I didn't see this parameter documented anywhere.
I got 727 training examples and 285 evaluation examples from this.
Then when I train a spacy model with this data, I get poor results. 0.3 F1 score and a rising TOK2VEC and NER loss indicating something is very wrong.
I was able to train a spacy model with similar data collected using the ner_manual prodigy view. We collected about 800 examples and got 0.7 F1 with diminishing TOK2VEC and NER loss.
Both models are using very similar config files, using pretrained word vectors and tok2vec on a large dataset.
I suspect that
data-to-spacy is somehow not picking up on my custom_tokenizer since when I run
-F (see below) I get similar performance (0.3 F1 with increasing losses)
prodigy data-to-spacy output --ner my_dataset --ner-missing
I'm not sure how to debug further. How can I verify if my tokenizer is being used or what parameter should I be using to pass the custom tokenizer?