Custom model, ner train, score always 0

I created a custom model, like this

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_md")

# split prevNameTail, prev_name_tail, obj.child.key
infixes = tuple([r"[A-Z][a-z0-9]+", r"[a-z0-9A-Z]+"]) + tuple(nlp.Defaults.prefixes) + tuple([r"[^\w\s]+"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.to_disk('my_en_core_web_md')

And train command:

python -m prodigy train my-model --ner gold_ner --base-model my_en_core_web_md --label-stats --eval-split 0.2

Tried many times, score always 0:

Hi @linb ,

Based from your screenshot, it seems that you only have 10 examples in total, is this intentional? Usually there's not much to learn from 8 examples, and not much to evaluate with 2 (i.e., if you get both wrong, then the accuracy is 0). In a way, the numbers we see during training makes sense given the number of samples we have.

My suggestion is to try it out in a relatively large sample of data. You can go little by little, first in the order of hundreds and so on.

Thanks for your advice. I added 99 examples. Same result:

python -m prodigy train scripts-model-1 --ner script_gold_ner-1 --base-model Arganteal_en_core_web_md --label-stats --eval-split 0.25

I think , the reason maybe - when use customized '--base-model', the prodigy ignore the new tokenizer.

According to the logs, you still only have 24 evaluation examples so it's a bit difficult to draw meaningful conclusions here since the evaluation data is so small.

If your data requires your custom tokenizer and may not predict accurately if it's not avaliable, this could be something to look into. If your base model is only intended to provide the tokenizer and no trained components to update, can you add it via the --config instead and leave out the --base-model?

If I use --config, how can I modify the tokenizer setting to my custom tokenizer?

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1" ???

Let's say, the tokenizer file is my_tokenizer.py

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

@spacy.registry.tokenizers("my_tokenizer")
def create_my_tokenizer():
    infixes = (r".",)
    infix_re = compile_infix_regex(infixes)

    def create_tokenizer(nlp):
        return Tokenizer(
                    nlp.vocab, 
                    suffix_search=None,
                    prefix_search=None,
                    infix_finditer=infix_re.finditer,
                    token_match=None)

    return create_tokenizer

Yes, that's pretty much it :slightly_smiling_face: You'd then set the tokenizer to your custom tokenizer:

[nlp.tokenizer]
@tokenizers = "my_tokenizer"

... and use the -F option in prodigy train to point at the file including the code, e.g. -F my_tokenizer.py.