Custom model, ner train, score always 0

linb · March 28, 2022, 6:38pm

I created a custom model, like this

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_md")

# split prevNameTail, prev_name_tail, obj.child.key
infixes = tuple([r"[A-Z][a-z0-9]+", r"[a-z0-9A-Z]+"]) + tuple(nlp.Defaults.prefixes) + tuple([r"[^\w\s]+"])
infix_re = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
nlp.to_disk('my_en_core_web_md')

And train command:

python -m prodigy train my-model --ner gold_ner --base-model my_en_core_web_md --label-stats --eval-split 0.2

Tried many times, score always 0:

ljvmiranda921 · March 29, 2022, 1:06am

Hi @linb ,

Based from your screenshot, it seems that you only have 10 examples in total, is this intentional? Usually there's not much to learn from 8 examples, and not much to evaluate with 2 (i.e., if you get both wrong, then the accuracy is 0). In a way, the numbers we see during training makes sense given the number of samples we have.

My suggestion is to try it out in a relatively large sample of data. You can go little by little, first in the order of hundreds and so on.

linb · March 29, 2022, 12:09pm

Thanks for your advice. I added 99 examples. Same result:

python -m prodigy train scripts-model-1 --ner script_gold_ner-1 --base-model Arganteal_en_core_web_md --label-stats --eval-split 0.25

I think , the reason maybe - when use customized '--base-model', the prodigy ignore the new tokenizer.

ines · March 30, 2022, 10:50am

According to the logs, you still only have 24 evaluation examples so it's a bit difficult to draw meaningful conclusions here since the evaluation data is so small.

If your data requires your custom tokenizer and may not predict accurately if it's not avaliable, this could be something to look into. If your base model is only intended to provide the tokenizer and no trained components to update, can you add it via the --config instead and leave out the --base-model?

linb · March 30, 2022, 2:06pm

If I use --config, how can I modify the tokenizer setting to my custom tokenizer?

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1" ???

Let's say, the tokenizer file is my_tokenizer.py

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

@spacy.registry.tokenizers("my_tokenizer")
def create_my_tokenizer():
    infixes = (r".",)
    infix_re = compile_infix_regex(infixes)

    def create_tokenizer(nlp):
        return Tokenizer(
                    nlp.vocab, 
                    suffix_search=None,
                    prefix_search=None,
                    infix_finditer=infix_re.finditer,
                    token_match=None)

    return create_tokenizer

ines · March 30, 2022, 2:59pm

Yes, that's pretty much it You'd then set the tokenizer to your custom tokenizer:

[nlp.tokenizer]
@tokenizers = "my_tokenizer"

... and use the -F option in prodigy train to point at the file including the code, e.g. -F my_tokenizer.py.

Topic		Replies	Views
Problem with custom model - ner train - usage , ner , done , training	5	659	September 16, 2021
Train recipe uses different Tokenizer than in ner.manual ner	1	324	August 8, 2023
Roadmap of having a unified model for tokenizing, NER and dependency parsing using Prodigy ner , spacy , custom , training	1	418	July 7, 2023
`prodigy train` doesn't seem to use the tokenizer from base-model training	2	308	May 1, 2023
data-to-spacy --base-model usage	6	372	September 13, 2023

Custom model, ner train, score always 0

Related topics