Spancat is not trained

aida.sharif · June 2, 2022, 3:19am

Hi Ines,

I am trying to train a model for spans (I have a single label), however when I train the model all the performance scores are zero, in other words the model learned nothing. I also tried your en_core_web_sm solution and it did not work.

Here is my config file:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","spancat"]
batch_size = 1000
disabled =
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"at tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.spancat.suggester]
misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH","SHAPE"]
rows = [5000,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]
readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0
ner = null
textcat = null
textcat_multilabel = null
parser = null
tagger = null
senter = null

[corpora.spancat]
readers = "prodigy.SpanCatCorpus.v1"
datasets = ["ops_spans_not_custom"]
eval_datasets =
spans_key = "sc"

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components =
annotating_components =
before_to_disk = null

[training.batcher]
batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
before_init = {"callbacks":"customize_tokenizer"}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.components]

[initialize.tokenizer]

here is how my spans look like (they all have underlines):

car_number_123_irf

and here is my training recipe:

!prodigy train ./model3 --spancat dataset_manual -c config.cfg -F functions.py --verbose

I am using a custom tokenizer, and this is my functions.py:

def make_customize_tokenizer():
def customize_tokenizer(nlp):
special_cases = {"A.M.": [{"ORTH": "A.M."}],
"P.M.": [{"ORTH": "P.M."}],
"U.S.": [{"ORTH": "U.S."}]}
prefix_re = re.compile(r'''''')
suffix_re = re.compile(r'''()."']|('s))$''')
infix_re = re.compile(r'''[-~:_/\.,]''')
# remove a suffix
nlp.tokenizer = Tokenizer(nlp.vocab, rules=special_cases,
prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
url_match=nlp.Defaults.url_match)
return customize_tokenizer

I have also tried the standard tokenizers but the training step skips all tagged spans and the training performance is zero.

koaning · June 2, 2022, 8:53am

Hi Aida.

Ines sometimes replies to questions on the support forum, but there's a team of folks who reply to the messages here and we cannot guarantee who replies because this depends on availability.

One small preference, could you make sure that your code is surrounded by ticks (```) in the future, that way pretty code blocks render, which makes it easier to read the code and to help.

however when I train the model all the performance scores are zero

Could you share the output of the prodigy train command? When I run a spancat train command locally I see something like this:

> python -m prodigy train --spancat namespandemo

ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 1 to 2 (inferred from data)
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-06-02 10:50:19,238] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: spancat (1)
[2022-06-02 10:50:19,250] [INFO] Pipeline: ['spancat']
[2022-06-02 10:50:19,253] [INFO] Created vocabulary
[2022-06-02 10:50:19,254] [INFO] Finished initializing nlp object
[2022-06-02 10:50:19,278] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 4 | Evaluation: 1 (20% split)
Training: 4 | Evaluation: 1
Labels: spancat (1)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0          4.68        0.00        0.00        0.00    0.00
200     200          6.61        0.00        0.00        0.00    0.00
400    4200          7.88      100.00      100.00      100.00    1.00

There are a bunch of zero scores in this output at the beginning of the training run, but that can be normal. It might just be that you need to allow for more steps before it starts scoring well. This might be what you're experiencing, but I'm not 100% sure.

Is there a reason you're using a custom config file?

koaning · June 2, 2022, 9:44am

You might also benefit from having a larger n_gram_range in your config. Also, notice the @ symbol in the config below.

[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3,4,5,6,7,8,9,10]

aida.sharif · June 2, 2022, 3:07pm

aida.sharif · June 2, 2022, 3:09pm

@koaning Thank for your prompt response! How can I change the number of ngrams in config file without get them reset when I run the train recipe again?

koaning · June 2, 2022, 8:14pm

Please refrain from using screenshots to share code or text output. It makes it impossible to copy/paste and it also won't be searchable for other users.

How can I change the number of ngrams in config file without get them reset when I run the train recipe again?

Could you clarify what you mean with "without get them reset"? Did you replace the configuration and run again?

Could you also clarify why you set spans_key = "sc"? Is there a general reason why you opted for a custom configuration file?

aida.sharif · July 11, 2022, 8:22pm

Hi Koaning,

Do you have any guidelines on how to format our questions? I did not find anything and did not know screenshots are not allowed

Here is my recipe:

!prodigy train ./span_model --spancat ops_spans -c config.cfg -F functions.py --eval-split 0.2 --label-stats

This is the config file I use (I did not add spans_key = "sc", it was generated the first time I ran the recipe without the custom config file and I just added the misc and ngram size array that you menthioned :

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
rows = [5000,2000,1000,1000]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.spancat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
window_size = 1
maxout_pieces = 3
depth = 4

[components.spancat.suggester]
#@misc = "spacy.ngram_range_suggester.v1"
#min_size = 1
#max_size = 3

@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3,4,5,6,7,8,9,10]

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0
ner = null
textcat = null
textcat_multilabel = null
parser = null
tagger = null
senter = null

[corpora.spancat]
@readers = "prodigy.SpanCatCorpus.v1"
datasets = ["ops_spans"]
eval_datasets = []
spans_key = "sc"

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

and I get these results:

ℹ Using CPU

========================= Generating Prodigy config =========================
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-07-11 13:11:04,230] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 420 | Evaluation: 105 (20% split)
Training: 420 | Evaluation: 105
Labels: spancat (1)
[2022-07-11 13:11:04,275] [INFO] Pipeline: ['spancat']
[2022-07-11 13:11:04,279] [INFO] Created vocabulary
[2022-07-11 13:11:04,279] [INFO] Finished initializing nlp object
[2022-07-11 13:11:04,624] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 420 | Evaluation: 105 (20% split)
Training: 420 | Evaluation: 105
Labels: spancat (1)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0         41.75        0.00        0.00        0.00    0.00
  1     200        350.89        0.00        0.00        0.00    0.00
  2     400          1.87        0.00        0.00        0.00    0.00
  4     600          2.50        0.00        0.00        0.00    0.00
  6     800          0.01        0.00        0.00        0.00    0.00
  9    1000          0.00        0.00        0.00        0.00    0.00
 12    1200          0.00        0.00        0.00        0.00    0.00
 16    1400          0.00        0.00        0.00        0.00    0.00
 21    1600          0.00        0.00        0.00        0.00    0.00
✔ Saved pipeline to output directory
/Users/asharifr/NLP_textSummarization/span_model/model-last

============================== SPANS (per type) ==============================

              P      R      F
en_route   0.00   0.00   0.00

Please let me know if you need any more info.

P.S. Here is my functions.py:

from spacy.util import registry, compile_suffix_regex
import re
from spacy.tokenizer import Tokenizer

@registry.callbacks("customize_tokenizer")
def make_customize_tokenizer():
    def customize_tokenizer(nlp):
        special_cases = {"A.M.": [{"ORTH": "A.M."}], 
                    "P.M.": [{"ORTH": "P.M."}],
                    "U.S.": [{"ORTH": "U.S."}]}
        prefix_re = re.compile(r'''^[-\[\("']''')
        suffix_re = re.compile(r'''([\]\)\."']|('s))$''')
        infix_re = re.compile(r'''[-~:_/\\\.,]''')
        # remove a suffix
        nlp.tokenizer = Tokenizer(nlp.vocab, rules=special_cases,
                                prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                url_match=nlp.Defaults.url_match)
    return customize_tokenizer

koaning · July 12, 2022, 11:57am

Interesting. Looking at the output it indeed seems like it's not detecting much while the loss is still going down.

Is it possible for you to share a small set of examples such that I may try to reproduce the error locally?

aida.sharif · July 12, 2022, 5:04pm

I think I found the issue, your data seems to be missing annotations. If I look at your first example in JSON format it looks like this:

{
    "text": "EVENT TIME: 01/0100 - AND LATER", 
    "_input_hash": 1430753446, 
    "_task_hash": 503749019, 
    "_view_id": "spans_manual", 
    "answer": "accept", 
    "_timestamp": 1654136034
}

I fed this example to Prodigy manually and annotated something in it.

python -m prodigy spans.manual issue-5683 blank:en examples.jsonl --label demo

When I annotate, save and run the db-out command afterwards...

 python -m prodigy db-out issue-5683

... then this is what the annotation looks like:

{
    "text":"EVENT TIME: 01/0100 - AND LATER",
    "_input_hash":1430753446,
    "_task_hash":503749019,
    "tokens":[
        {"text":"EVENT","start":0,"end":5,"id":0,"ws":true},
        {"text":"TIME","start":6,"end":10,"id":1,"ws":false},
        {"text":":","start":10,"end":11,"id":2,"ws":true},
        {"text":"01/0100","start":12,"end":19,"id":3,"ws":true},
        {"text":"-","start":20,"end":21,"id":4,"ws":true},
        {"text":"AND","start":22,"end":25,"id":5,"ws":true},
        {"text":"LATER","start":26,"end":31,"id":6,"ws":false}
    ],
    "_view_id":"spans_manual",
    "spans":[
        {"start":0,"end":19,"token_start":0,"token_end":3,"label":"demo"}
    ],
    "answer":"accept",
    "_timestamp":1657698750
}

Notice the spans key there? That's the annotation that's needed to update the span categorization model. I just checked your data and it seems like non of your examples have this key, which suggests something went awry during annotation.

Can you try annotating one example to confirm that the spans key does appear on your end?

aida.sharif · July 13, 2022, 10:56pm

I do have the spans key in my json file, there might not be so many of them but if you look for the keyword "'spans':" you will see them; here is an example:

'spans': [{'start': 35, 'end': 45, 'label': 'en_route'}

koaning · July 14, 2022, 11:22am

Could you send me an entire example that has a span key? Even when there is no data annotated then the annotated example should still contain a "spans"-key.

{
  "text":"EVENT TIME: 01/0100 - AND LATER AND STUFF",
  "_input_hash":-2139263339,
  "_task_hash":-148943722,
  "tokens":[
    {"text":"EVENT","start":0,"end":5,"id":0,"ws":true},
    {"text":"TIME","start":6,"end":10,"id":1,"ws":false},
    {"text":":","start":10,"end":11,"id":2,"ws":true},
    {"text":"01/0100","start":12,"end":19,"id":3,"ws":true},
    {"text":"-","start":20,"end":21,"id":4,"ws":true},
    {"text":"AND","start":22,"end":25,"id":5,"ws":true},
    {"text":"LATER","start":26,"end":31,"id":6,"ws":true},
    {"text":"AND","start":32,"end":35,"id":7,"ws":true},
    {"text":"STUFF","start":36,"end":41,"id":8,"ws":false}
  ],
  "_view_id":"spans_manual",
  "spans":[],
  "answer":"accept",
  "_timestamp":1657797480
}

Can you confirm that when you annotate a single example that the "spans" key appears? It should appear in all your annotated examples. You seem to have removed your examples, but when I downloaded them they all had "spans" missing.

aida.sharif · July 15, 2022, 7:01pm

Here is the full example:

{'text': 'EN ROUTE ACTIVE:\n1100-2200       -UPSTATE_NY_AND_CANADA_VIA_J61_Q103 \n1100-2200       -OHIO_VALLEY_TO_FLORIDA \n1100-2200       -MIDWEST_TO_FLORIDA_PARTIAL \n1100-2200       -TEXAS_AND_ZME_TO_ZNY_ZBW \n1100-2200       -TEXAS_AND_ZME_TO_DC_METS ', '_input_hash': 1921527186, '_task_hash': -474446979, '_view_id': 'spans_manual', 'spans': [{'start': 34, 'end': 68, 'label': 'en_route'}, {'start': 87, 'end': 109, 'label': 'en_route'}, {'start': 128, 'end': 154, 'label': 'en_route'}, {'start': 173, 'end': 197, 'label': 'en_route'}, {'start': 216, 'end': 240, 'label': 'en_route'}], 'answer': 'accept', '_timestamp': 1654136099}

I don't have the spans key when there is no span tagged. I use spans.manual to annotate my examples, where should I see "spans" key appear?

I wanted to upload the whole dataset but it was large and your website did not allow me.

koaning · July 27, 2022, 12:05pm

Sorry for the delay!

I'm having a bit of trouble replicating this locally, so instead it might be good to try running an experiment on your end. I annotated some examples on my machine. It's an easy example about detecting the name "Vincent".

{"text":"Hi. I\u2019m Vincent.","_input_hash":-144985579,"_task_hash":-1382444776,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":".","start":2,"end":3,"id":1,"ws":true},{"text":"I","start":4,"end":5,"id":2,"ws":false},{"text":"\u2019m","start":5,"end":7,"id":3,"ws":true},{"text":"Vincent","start":8,"end":15,"id":4,"ws":false},{"text":".","start":15,"end":16,"id":5,"ws":false}],"_view_id":"spans_manual","spans":[{"start":8,"end":15,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1658923184}
{"text":"Hey guys. I\u2019m Vincent.","_input_hash":-421603857,"_task_hash":1676229821,"tokens":[{"text":"Hey","start":0,"end":3,"id":0,"ws":true},{"text":"guys","start":4,"end":8,"id":1,"ws":false},{"text":".","start":8,"end":9,"id":2,"ws":true},{"text":"I","start":10,"end":11,"id":3,"ws":false},{"text":"\u2019m","start":11,"end":13,"id":4,"ws":true},{"text":"Vincent","start":14,"end":21,"id":5,"ws":false},{"text":".","start":21,"end":22,"id":6,"ws":false}],"_view_id":"spans_manual","spans":[{"start":14,"end":21,"token_start":5,"token_end":5,"label":"name"}],"answer":"accept","_timestamp":1658923186}
{"text":"I\u2019m Vincent.","_input_hash":-1167920099,"_task_hash":-1777910724,"tokens":[{"text":"I","start":0,"end":1,"id":0,"ws":false},{"text":"\u2019m","start":1,"end":3,"id":1,"ws":true},{"text":"Vincent","start":4,"end":11,"id":2,"ws":false},{"text":".","start":11,"end":12,"id":3,"ws":false}],"_view_id":"spans_manual","spans":[{"start":4,"end":11,"token_start":2,"token_end":2,"label":"name"}],"answer":"accept","_timestamp":1658923187}
{"text":"Um, hey guys. I\u2019m Vincent.","_input_hash":-1621557865,"_task_hash":-1088157028,"tokens":[{"text":"Um","start":0,"end":2,"id":0,"ws":false},{"text":",","start":2,"end":3,"id":1,"ws":true},{"text":"hey","start":4,"end":7,"id":2,"ws":true},{"text":"guys","start":8,"end":12,"id":3,"ws":false},{"text":".","start":12,"end":13,"id":4,"ws":true},{"text":"I","start":14,"end":15,"id":5,"ws":false},{"text":"\u2019m","start":15,"end":17,"id":6,"ws":true},{"text":"Vincent","start":18,"end":25,"id":7,"ws":false},{"text":".","start":25,"end":26,"id":8,"ws":false}],"_view_id":"spans_manual","spans":[{"start":18,"end":25,"token_start":7,"token_end":7,"label":"name"}],"answer":"accept","_timestamp":1658923188}
{"text":"Hi everyone. I\u2019m Vincent.","_input_hash":1533256870,"_task_hash":1795565040,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":true},{"text":"everyone","start":3,"end":11,"id":1,"ws":false},{"text":".","start":11,"end":12,"id":2,"ws":true},{"text":"I","start":13,"end":14,"id":3,"ws":false},{"text":"\u2019m","start":14,"end":16,"id":4,"ws":true},{"text":"Vincent","start":17,"end":24,"id":5,"ws":false},{"text":".","start":24,"end":25,"id":6,"ws":false}],"_view_id":"spans_manual","spans":[{"start":17,"end":24,"token_start":5,"token_end":5,"label":"name"}],"answer":"accept","_timestamp":1658923190}
{"text":"Hey, I\u2019m Vincent.","_input_hash":-969710892,"_task_hash":296297103,"tokens":[{"text":"Hey","start":0,"end":3,"id":0,"ws":false},{"text":",","start":3,"end":4,"id":1,"ws":true},{"text":"I","start":5,"end":6,"id":2,"ws":false},{"text":"\u2019m","start":6,"end":8,"id":3,"ws":true},{"text":"Vincent","start":9,"end":16,"id":4,"ws":false},{"text":".","start":16,"end":17,"id":5,"ws":false}],"_view_id":"spans_manual","spans":[{"start":9,"end":16,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1658923191}
{"text":"Hi, I\u2019m Vincent.","_input_hash":1492765203,"_task_hash":1032281970,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":",","start":2,"end":3,"id":1,"ws":true},{"text":"I","start":4,"end":5,"id":2,"ws":false},{"text":"\u2019m","start":5,"end":7,"id":3,"ws":true},{"text":"Vincent","start":8,"end":15,"id":4,"ws":false},{"text":".","start":15,"end":16,"id":5,"ws":false}],"_view_id":"spans_manual","spans":[{"start":8,"end":15,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1658923191}
{"text":"Hi, I am Vincent.","_input_hash":-1682537876,"_task_hash":-2016748550,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":",","start":2,"end":3,"id":1,"ws":true},{"text":"I","start":4,"end":5,"id":2,"ws":true},{"text":"am","start":6,"end":8,"id":3,"ws":true},{"text":"Vincent","start":9,"end":16,"id":4,"ws":false},{"text":".","start":16,"end":17,"id":5,"ws":false}],"_view_id":"spans_manual","spans":[{"start":9,"end":16,"token_start":4,"token_end":4,"label":"name"}],"answer":"accept","_timestamp":1658923219}
{"text":"Hi, the name is Vincent.","_input_hash":222736874,"_task_hash":1101971754,"tokens":[{"text":"Hi","start":0,"end":2,"id":0,"ws":false},{"text":",","start":2,"end":3,"id":1,"ws":true},{"text":"the","start":4,"end":7,"id":2,"ws":true},{"text":"name","start":8,"end":12,"id":3,"ws":true},{"text":"is","start":13,"end":15,"id":4,"ws":true},{"text":"Vincent","start":16,"end":23,"id":5,"ws":false},{"text":".","start":23,"end":24,"id":6,"ws":false}],"_view_id":"spans_manual","spans":[{"start":16,"end":23,"token_start":5,"token_end":5,"label":"name"}],"answer":"accept","_timestamp":1658923221}

You can save these examples in a file called names-example.jsonl which you can then feed to Prodigy.

python -m prodigy db-in names-examples names-example.jsonl

This is what I see when I train a model via Prodigy.

python -m prodigy train --spancat names-examples

When I run this, I can confirm that spancat is being trained as I'd expect.

E    #       LOSS TOK2VEC  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ------------  ----------  ----------  ----------  ------
  0       0          0.32         15.24       66.67       50.00      100.00    0.67
200     200          0.20          5.38      100.00      100.00      100.00    1.00
400     400          0.00          0.00      100.00      100.00      100.00    1.00
600     600          0.00          0.00      100.00      100.00      100.00    1.00
800     800          0.00          0.00      100.00      100.00      100.00    1.00
1000    1000          0.00          0.00      100.00      100.00      100.00    1.00
1200    1200          0.00          0.00      100.00      100.00      100.00    1.00
1400    1400          0.00          0.00      100.00      100.00      100.00    1.00
1600    1600          0.00          0.00      100.00      100.00      100.00    1.00
1800    1800          0.00          0.00      100.00      100.00      100.00    1.00

Could you run this on your machine? If this runs normally, then we have evidence that there is an issue with your dataset instead.

Topic		Replies	Views
Spancat Give scores 0	5	527	February 20, 2024
Span Cat Annotations and Incorrect Predictions spacy , spancat	4	844	June 8, 2023
Unable to use train and run data-to-spacy recipes for spancat on prodigy 1.11.10 solved , spancat	4	875	May 4, 2023
impact of percentage of evaluation data on performance spacy , spancat	9	944	December 13, 2022
Spacy NER - tokeniser for camembert-base ner	17	1143	March 15, 2023

Spancat is not trained

Related topics