Spancat training from db-in'd dataset not working

DGMS90 · April 13, 2022, 7:47am

Hey

I have used Prodigy (ner.manual and ner.correct recipes used) to produce an annotated NER/span dataset (extracting qualification entities from job descriptions) and having given the annotation strategy some further thought, would like to subsume one of the classes under another. I.e. a class I thought was useful at the outset is actually not so useful, and those instances would be better labelled as one of the other, more common classes.

I have used the db-out recipe, opened the .jsonl file and search-replaced the original label with the one I'd now like to have. I then use the db-in recipe on this file to create a corrected dataset, which works correctly and has the correct number of instances.

However, I then try to use this dataset to train a spancat model and receive the following error trace:

Auto-generating config with spaCy
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2288.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.2288.0_x64__qbz5n2kfra8p0\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 329, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\train.py", line 261, in train
    train_config = prodigy_config(
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\train.py", line 112, in prodigy_config
    config = generate_default_config(pipes, lang, base_nlp, silent=silent)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\train.py", line 571, in generate_default_config
    suggester = infer_spancat_suggester(examples, nlp)
  File "C:\Projects\PROJECT-NAME\prodigy_env\lib\site-packages\prodigy\recipes\data_utils.py", line 959, in infer_spancat_suggester
    char_span = doc.char_span(span["start"], span["end"])
TypeError: string indices must be integers

All I can think is that I've somehow altered the input data, but I can't work out how I might have done this.

I'd really appreciate any help

koaning · April 13, 2022, 11:02am

Hi Darren!

Just so I understand correctly, did you edit the .jsonl file by hand, or did you use a script?

It's possible that a typo is causing the issues here, do you have a single example that demonstrates when this error is triggered?

DGMS90 · April 13, 2022, 2:17pm

Hey Vincent,

Thanks for the reply.

I edited the .jsonl file produced from db-out by opening it in Notepad++, searching for the label I wanted to be replaced with the label I wanted to use instead. Saved the file, and then imported again using db-in.

I have attached a file containing two exemplar sentences with annotations. The first is copied straight from the original file. The second is the same example with the change made (QUAL_CERTIFICATION to QUAL_PROFESSIONAL). I then saved and created a dataset using the db-in command, which I subsequently had issues building a spancat model with.

Hope this is enough info.

example_for_prodigy_support.jsonl (3.4 KB)

koaning · April 13, 2022, 4:54pm

I saved your examples in a file called spancats.jsonl. It seems to load fine on my machine when I open the file in prodigy via the spans.manual recipe.

prodigy spans.manual spancat-demo en_core_web_md spancats.jsonl --label QUAL_PROFESSIONAL,QUAL_CERTIFICATION,QUAL_SUBJECT

It also seems to run fine when I run:

> prodigy db-in demo spancats.jsonl 
✔ Created dataset 'demo' in database SQLite
✔ Imported 2 annotations to 'demo' (session 2022-04-13_18-53-21) in
database SQLite
Found and keeping existing "answer" in 2 examples

Just to confirm, are these the commands that you ran?

DGMS90 · April 14, 2022, 8:58am

I used the db-in command and got the same response as you.

The error I'm having occurs when I use the train --spancat command on the dataset created using db-in.

koaning · April 14, 2022, 3:55pm

It seems to run just fine on my machine when I run;

python -m prodigy train --spancat spancat-demo

The thing here, is that the train command doesn't read the .jsonl file. It picks the data directly from the prodigy database. So I labeled a single example from what you shared and got this output:

ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 2 to 3 (inferred from data)
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-04-14 17:54:35,567] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (3)
[2022-04-14 17:54:35,577] [INFO] Pipeline: ['spancat']
[2022-04-14 17:54:35,579] [INFO] Created vocabulary
[2022-04-14 17:54:35,580] [INFO] Finished initializing nlp object
[2022-04-14 17:54:35,609] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (3)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0         44.47        0.00        0.00        0.00    0.00
200     200         54.13        0.00        0.00        0.00    0.00
400     400          0.08        0.00        0.00        0.00    0.00
600     600          0.05        0.00        0.00        0.00    0.00
800     800          0.03        0.00        0.00        0.00    0.00
1000    1000          0.02        0.00        0.00        0.00    0.00
1200    1200          0.02        0.00        0.00        0.00    0.00
1400    1400          0.01        0.00        0.00        0.00    0.00
1600    1600          0.01        0.00        0.00        0.00    0.00

Note that spancat-demo is the name of the dataset in Prodigy.

koaning · April 14, 2022, 4:02pm

Oh wait. I think I'm doing something different. I'm not using db-in directly here because I'd added the examples via labeling. Let me try that real quick!

koaning · April 14, 2022, 4:05pm

I think it still works on my machine though. Here I am creating a dataset called foobar.

> prodigy db-in foobar spancats.jsonl 
✔ Created dataset 'foobar' in database SQLite
✔ Imported 2 annotations to 'foobar' (session 2022-04-14_18-03-10) in
database SQLite
Found and keeping existing "answer" in 2 examples

And here I seem to be training spancat just fine.

> python -m prodigy train --spancat foobar             
ℹ Using CPU

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 2 to 3 (inferred from data)
✔ Generated training config

=========================== Initializing pipeline ===========================
[2022-04-14 18:03:18,980] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 2 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (4)
[2022-04-14 18:03:18,990] [INFO] Pipeline: ['spancat']
[2022-04-14 18:03:18,992] [INFO] Created vocabulary
[2022-04-14 18:03:18,993] [INFO] Finished initializing nlp object
[2022-04-14 18:03:19,017] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 2 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (4)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE 
---  ------  ------------  ----------  ----------  ----------  ------
  0       0         63.69        0.00        0.00        0.00    0.00
200     200         96.16        0.00        0.00        0.00    0.00
400     400          0.14        0.00        0.00        0.00    0.00
600     600          0.07        0.00        0.00        0.00    0.00
800     800          0.05        0.00        0.00        0.00    0.00
1000    1000          0.04        0.00        0.00        0.00    0.00
1200    1200          0.03        0.00        0.00        0.00    0.00
1400    1400          0.02        0.00        0.00        0.00    0.00
1600    1600          0.02        0.00        0.00        0.00    0.00

Can you confirm that if you repeat these two exact steps that you see something different?

DGMS90 · April 22, 2022, 10:38am

Sorry for the delay - just coming back to this project.

For some reason this is working now. I can only think that for some reason restarting my machine since I was last trying this has made a difference. Sorry I can't find a better explanation

Thanks for your help, really appreciated

Topic		Replies	Views
Losing spancat labels when training after using prodigy db-merge spacy , spancat	12	339	January 3, 2024
Spancat is not trained spancat	12	1113	July 27, 2022
SpanCat Training Error on Custom Preprocessed Dataset usage , training , spancat	6	838	March 7, 2023
Span Cat Annotations and Incorrect Predictions spacy , spancat	4	846	June 8, 2023
Exclude labels when training spancat spancat	2	422	February 10, 2023

Spancat training from db-in'd dataset not working

Related topics