Spancat is not trained

Hi Ines,

I am trying to train a model for spans (I have a single label), however when I train the model all the performance scores are zero, in other words the model learned nothing. I also tried your en_core_web_sm solution and it did not work.

Here is my config file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 1000
disabled =
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"at tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat.suggester]
misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]
readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0
ner = null
textcat = null
textcat_multilabel = null
parser = null
tagger = null
senter = null

[corpora.spancat]
readers = "prodigy.SpanCatCorpus.v1"
datasets = ["ops_spans_not_custom"]
eval_datasets =
spans_key = "sc"

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components =
annotating_components =
before_to_disk = null

[training.batcher]
batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
before_init = {"callbacks":"customize_tokenizer"}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.components]

[initialize.tokenizer]

here is how my spans look like (they all have underlines):

car_number_123_irf

and here is my training recipe:

!prodigy train ./model3 --spancat dataset_manual -c config.cfg -F functions.py --verbose

I am using a custom tokenizer, and this is my functions.py:

def make_customize_tokenizer():
def customize_tokenizer(nlp):
special_cases = {"A.M.": [{"ORTH": "A.M."}],
"P.M.": [{"ORTH": "P.M."}],
"U.S.": [{"ORTH": "U.S."}]}
prefix_re = re.compile(r'''''')
suffix_re = re.compile(r'''()."']|('s))$''')
infix_re = re.compile(r'''[-~:_/\.,]''')
# remove a suffix
nlp.tokenizer = Tokenizer(nlp.vocab, rules=special_cases,
prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
url_match=nlp.Defaults.url_match)
return customize_tokenizer

I have also tried the standard tokenizers but the training step skips all tagged spans and the training performance is zero.

Hi Aida.

Ines sometimes replies to questions on the support forum, but there's a team of folks who reply to the messages here and we cannot guarantee who replies because this depends on availability.

One small preference, could you make sure that your code is surrounded by ticks (```) in the future, that way pretty code blocks render, which makes it easier to read the code and to help.

however when I train the model all the performance scores are zero

Could you share the output of the prodigy train command? When I run a spancat train command locally I see something like this:

> python -m prodigy train --spancat namespandemo

ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 1 to 2 (inferred from data)
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-06-02 10:50:19,238] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: spancat (1)
[2022-06-02 10:50:19,250] [INFO] Pipeline: ['spancat']
[2022-06-02 10:50:19,253] [INFO] Created vocabulary
[2022-06-02 10:50:19,254] [INFO] Finished initializing nlp object
[2022-06-02 10:50:19,278] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: spancat (1)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0          4.68        0.00        0.00        0.00    0.00
200     200          6.61        0.00        0.00        0.00    0.00
400    4200          7.88      100.00      100.00      100.00    1.00

There are a bunch of zero scores in this output at the beginning of the training run, but that can be normal. It might just be that you need to allow for more steps before it starts scoring well. This might be what you're experiencing, but I'm not 100$ sure.

Is there a reason you're using a custom config file?

You might also benefit from having a larger n_gram_range in your config. Also, notice the @ symbol in the config below.

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3,4,5,6,7,8,9,10]


@koaning Thank for your prompt response! How can I change the number of ngrams in config file without get them reset when I run the train recipe again?

Please refrain from using screenshots to share code or text output. It makes it impossible to copy/paste and it also won't be searchable for other users.

How can I change the number of ngrams in config file without get them reset when I run the train recipe again?

Could you clarify what you mean with "without get them reset"? Did you replace the configuration and run again?

Could you also clarify why you set spans_key = "sc"? Is there a general reason why you opted for a custom configuration file?