Spancat training from db-in'd dataset not working

Hey :blush:

I have used Prodigy (ner.manual and ner.correct recipes used) to produce an annotated NER/span dataset (extracting qualification entities from job descriptions) and having given the annotation strategy some further thought, would like to subsume one of the classes under another. I.e. a class I thought was useful at the outset is actually not so useful, and those instances would be better labelled as one of the other, more common classes.

I have used the db-out recipe, opened the .jsonl file and search-replaced the original label with the one I'd now like to have. I then use the db-in recipe on this file to create a corrected dataset, which works correctly and has the correct number of instances.

However, I then try to use this dataset to train a spancat model and receive the following error trace:

Auto-generating config with spaCy
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2288.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2288.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\train.py", line 261, in train
    train_config = prodigy_config(
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\train.py", line 112, in prodigy_config
    config = generate_default_config(pipes, lang, base_nlp, silent=silent)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\train.py", line 571, in generate_default_config
    suggester = infer_spancat_suggester(examples, nlp)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\data_utils.py", line 959, in infer_spancat_suggester
    char_span = doc.char_span(span["start"], span["end"])
TypeError: string indices must be integers

All I can think is that I've somehow altered the input data, but I can't work out how I might have done this.

I'd really appreciate any help :pray:

Hi Darren!

Just so I understand correctly, did you edit the .jsonl file by hand, or did you use a script?

It's possible that a typo is causing the issues here, do you have a single example that demonstrates when this error is triggered?

Hey Vincent,

Thanks for the reply.

I edited the .jsonl file produced from db-out by opening it in Notepad++, searching for the label I wanted to be replaced with the label I wanted to use instead. Saved the file, and then imported again using db-in.

I have attached a file containing two exemplar sentences with annotations. The first is copied straight from the original file. The second is the same example with the change made (QUAL_CERTIFICATION to QUAL_PROFESSIONAL). I then saved and created a dataset using the db-in command, which I subsequently had issues building a spancat model with.

Hope this is enough info.

example_for_prodigy_support.jsonl (3.4 KB)

I saved your examples in a file called spancats.jsonl. It seems to load fine on my machine when I open the file in prodigy via the spans.manual recipe.

prodigy spans.manual spancat-demo en_core_web_md spancats.jsonl --label QUAL_PROFESSIONAL,QUAL_CERTIFICATION,QUAL_SUBJECT

It also seems to run fine when I run:

> prodigy db-in demo spancats.jsonl 
✔ Created dataset 'demo' in database SQLite
✔ Imported 2 annotations to 'demo' (session 2022-04-13_18-53-21) in
database SQLite
Found and keeping existing "answer" in 2 examples

Just to confirm, are these the commands that you ran?

I used the db-in command and got the same response as you.

The error I'm having occurs when I use the train --spancat command on the dataset created using db-in.

It seems to run just fine on my machine when I run;

python -m prodigy train --spancat spancat-demo

The thing here, is that the train command doesn't read the .jsonl file. It picks the data directly from the prodigy database. So I labeled a single example from what you shared and got this output:

ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 2 to 3 (inferred from data)
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-04-14 17:54:35,567] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (3)
[2022-04-14 17:54:35,577] [INFO] Pipeline: ['spancat']
[2022-04-14 17:54:35,579] [INFO] Created vocabulary
[2022-04-14 17:54:35,580] [INFO] Finished initializing nlp object
[2022-04-14 17:54:35,609] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (3)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0         44.47        0.00        0.00        0.00    0.00
200     200         54.13        0.00        0.00        0.00    0.00
400     400          0.08        0.00        0.00        0.00    0.00
600     600          0.05        0.00        0.00        0.00    0.00
800     800          0.03        0.00        0.00        0.00    0.00
1000    1000          0.02        0.00        0.00        0.00    0.00
1200    1200          0.02        0.00        0.00        0.00    0.00
1400    1400          0.01        0.00        0.00        0.00    0.00
1600    1600          0.01        0.00        0.00        0.00    0.00

Note that spancat-demo is the name of the dataset in Prodigy.

Oh wait. I think I'm doing something different. I'm not using db-in directly here because I'd added the examples via labeling. Let me try that real quick!

I think it still works on my machine though. Here I am creating a dataset called foobar.

> prodigy db-in foobar spancats.jsonl 
✔ Created dataset 'foobar' in database SQLite
✔ Imported 2 annotations to 'foobar' (session 2022-04-14_18-03-10) in
database SQLite
Found and keeping existing "answer" in 2 examples

And here I seem to be training spancat just fine.

> python -m prodigy train --spancat foobar             
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 2 to 3 (inferred from data)
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-04-14 18:03:18,980] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 2 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (4)
[2022-04-14 18:03:18,990] [INFO] Pipeline: ['spancat']
[2022-04-14 18:03:18,992] [INFO] Created vocabulary
[2022-04-14 18:03:18,993] [INFO] Finished initializing nlp object
[2022-04-14 18:03:19,017] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 2 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (4)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0         63.69        0.00        0.00        0.00    0.00
200     200         96.16        0.00        0.00        0.00    0.00
400     400          0.14        0.00        0.00        0.00    0.00
600     600          0.07        0.00        0.00        0.00    0.00
800     800          0.05        0.00        0.00        0.00    0.00
1000    1000          0.04        0.00        0.00        0.00    0.00
1200    1200          0.03        0.00        0.00        0.00    0.00
1400    1400          0.02        0.00        0.00        0.00    0.00
1600    1600          0.02        0.00        0.00        0.00    0.00

Can you confirm that if you repeat these two exact steps that you see something different?

1 Like

Sorry for the delay - just coming back to this project.

For some reason this is working now. I can only think that for some reason restarting my machine since I was last trying this has made a difference. Sorry I can't find a better explanation :sweat_smile:

Thanks for your help, really appreciated :pray:

1 Like