SpanCat Training Error on Custom Preprocessed Dataset

Hi :raised_hand:

I'm new to machine learning and have been working with a dataset that I annotated using Prodigy. I trained a model using the CLI model training from Prodigy and everything ran smoothly.

However, I recently attempted to preprocess the dataset by applying some additional steps that altered the data. While there were no issues saving the preprocessed data to the Prodigy database, I encountered errors when trying to train the model using the following command:

python -m prodigy train ./training/spancat/test --spancat test --eval-split 0.25

The error message I received was:

:warning: Aborting and saving the final best model. Encountered exception:
ValueError('all sequence lengths must be >= 0')

I've attached links to the annotated and preprocessed dataset samples for reference. I'm hoping to get some advice on how to resolve these errors and improve the performance of my model with the preprocessed data.

Any insights or guidance would be greatly appreciated. Thanks in advance for your help!

Annotated data: Running normally · GitHub
Preprocessed data: Preproccesing not working on prodigy model training cli · GitHub

hi @daffahilmyf!

Thanks for your question and welcome to the Prodigy community :wave:

Thanks as well for providing your examples. This helped a lot!

I noticed that in your updated file it looks like you stemmed and removed stop words. Is this correct?

  • Original (worked): "text": "In order to fulfil the requirements of some railways, it should be possible to provide an alternative means of link assurance indication."
  • New (didn't work): "text": "in order fulfil requir some railway possibl provid altern mean link assur indic"

Can you explain more on why the pre-processing is needed?

There's no need to do pre-processing like this and we generally recommend against doing it. For example, this is a good post on the background on stop words:

1 Like

Thanks for your response. I really appreciate it! @ryanwesslen

I asked for help on the Spacy GitHub discussion and was able to find a solution for my problem. However, now I am encountering a new error :sweat_smile:

We've seen this error in the past when there was a bug related to docs without any suggestions, but this should be fixed in spacy v3.3.2 and v3.4.4

After downgrading the version, it started working. However, I am now facing a new error like this

Yes, that's correct.

Can you explain more on why the pre-processing is needed?

I am currently working on two research projects that focus on classification and named entity recognition (or more precisely, span categorization). I have followed suggestions from previous papers and guidance from my professor, which involves pre-processing the data to experiment with different models and determine the best model for evaluation.

Initially, I believed that conducting experiments on models using data preprocessing was common practice in NLP research. However, thank you for providing the forum for me to read.

I would love to continue this discussion further if possible. Thank you :slightly_smiling_face:

Thanks for your response.

Were your annotations altered with any pre- or post-processing? You should avoid any modifications to your annotations if you want to use prodigy train.

That message is a bit too vague for me to diagnose without more details. Was this all of the error message? If not, can you provide the full stack error message?

The closest I found was relating to tokenization:

I'm wondering if this is a tokenization problem because of some pre- or post-processing you may have done.

Alternatively if not, can you provide a small sample of your data like you did previously?

Also, moving forward, please avoid screen shots of code - you can instead copy/paste it directly. This enables it be searchable for the next user (e.g., now others could search by the same error message and find this post) :slight_smile:

Hi @ryanwesslen, thank you for your response. I apologize for my delayed reply.

Sure, I have noticed two errors that appear here (in two different preprocessed datasets)

=========================== Initializing pipeline ===========================
[2023-03-06 22:58:53,473] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 230 | Evaluation: 75 (25% split)
Training: 224 | Evaluation: 75
Labels: spancat (5)
[2023-03-06 22:58:53,511] [INFO] Pipeline: ['tok2vec', 'spancat']
[2023-03-06 22:58:53,513] [INFO] Created vocabulary
[2023-03-06 22:58:53,514] [INFO] Finished initializing nlp object
[2023-03-06 22:58:53,631] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 230 | Evaluation: 75 (25% split)
Training: 224 | Evaluation: 75
Labels: spancat (5)
ℹ Pipeline: ['tok2vec', 'spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE
---  ------  ------------  ------------  ----------  ----------  ----------  ------
⚠ Aborting and saving the final best model. Encountered exception:
IndexError('index -7099 is out of bounds for axis 0 with size 7096')
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\prodigy\__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 379, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\prodigy\recipes\train.py", line 289, in train
    return _train(
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\prodigy\recipes\train.py", line 209, in _train
    spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\training\loop.py", line 122, in train
    raise e
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\training\loop.py", line 105, in train
    for batch, info, is_best_checkpoint in training_step_iterator:
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\training\loop.py", line 203, in train_while_improving
    nlp.update(
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1164, in update
    proc.update(examples, sgd=None, losses=losses, **component_cfg[name])  # type: ignore
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\spancat.py", line 347, in update
    scores, backprop_scores = self.model.begin_update((docs, spans))
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 309, in begin_update
    return self._func(self, X, is_train=True)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\concatenate.py", line 44, in forward
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\concatenate.py", line 44, in <listcomp>
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\reduce_last.py", line 18, in forward
    Y = cast(OutT, Xr.dataXd[ends]) # type: ignore
IndexError: index -7099 is out of bounds for axis 0 with size 7096

and

=========================== Initializing pipeline ===========================
[2023-03-06 22:57:40,994] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 3 | Evaluation: 0 (25% split)
Training: 3 | Evaluation: 0
Labels: spancat (3)
[2023-03-06 22:57:41,005] [INFO] Pipeline: ['tok2vec', 'spancat']
[2023-03-06 22:57:41,007] [INFO] Created vocabulary
[2023-03-06 22:57:41,008] [INFO] Finished initializing nlp object
[2023-03-06 22:57:41,039] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
  - [spancat] Training: 3 | Evaluation: 0 (25% split)
Training: 3 | Evaluation: 0
Labels: spancat (3)
ℹ Pipeline: ['tok2vec', 'spancat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS SPANCAT  SPANS_SC_F  SPANS_SC_P  SPANS_SC_R  SCORE
---  ------  ------------  ------------  ----------  ----------  ----------  ------
Segmentation fault

Yes, I altered the annotations with preprocessing steps. After discussing with my professor, we concluded that we still need to perform preprocessing steps based on our research hypothesis. Is it still possible to train the model with preprocessed datasets?

Sure, I can provide one of my customizable preprocessed datasets.

datasets: Preprocessing datasets · GitHub

Thanks for the update (and especially the data!).

You seem to have some spans that have a "negative length". I think you may have done something unintended in post-processing (I don't think pre-processing would affect).

So I first loaded the data (db-in), exported to spacy bin files (data-to-spacy, outputting locally to a folder called data/issue-6405, but you can put whatever), then ran:

$ python -m spacy debug data data/issue-6405/config.cfg

============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: en
Training pipeline: tok2vec, spancat
239 training docs
60 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train a new pipeline (239)

============================== Vocab & Vectors ==============================
ℹ 3194 total word(s) in the data (1155 unique)
ℹ No word vectors present in the package

============================ Span Categorization ============================

Spans Key   Labels                        
---------   ------------------------------
sc          {'POSTCONDITION', 'ACTION', 'ACTOR', 'PRECONDITION', 'QUALITY'}

⚠ Low number of examples for label 'QUALITY' in key 'sc' (30)
⚠ Low number of examples for label 'ACTION' in key 'sc' (1)
ℹ Span characteristics for spans_key 'sc'
ℹ SD = Span Distinctiveness, BD = Boundary Distinctiveness

Span Type       Length     SD     BD     N
-------------   ------   ----   ----   ---
PRECONDITION      3.96   1.45   2.30    61
ACTOR             1.50   1.30   1.35   290
POSTCONDITION     9.18   0.08   1.99   258
QUALITY           5.24   2.26   2.54    30
ACTION            3.00   5.58   5.88     1
-------------   ------   ----   ----   ---
Wgt. Average      5.01   0.88   1.76     -

ℹ Over 90% of spans have lengths of 1 -- 12 (min=1, max=26). The most
common span lengths are: 1 (24.22%), 2 (17.97%), 3 (8.44%), 4 (3.28%), 5
(3.91%), 6 (3.75%), 7 (6.88%), 8 (6.41%), 9 (5.0%), 10 (3.59%), 11 (3.91%), 12
(2.81%). If you are using the n-gram suggester, note that omitting infrequent
n-gram lengths can greatly improve speed and memory usage.
⚠ Spans may not be distinct from the rest of the corpus
✔ Boundary tokens are distinct from the rest of the corpus
✔ Examples without ocurrences available for all labels

================================== Summary ==================================
✔ 5 checks passed
⚠ 4 warnings

Overall, nothing stood out as a major problem. I did notice you only had 1 example of 'ACTION'. You should likely exclude that record (e.g., if that goes into dev, how can the model train for it? or if it's in training, how can the model evaluate it?). When I did that it didn't remove the error at hand. Also be a little cautious around the very long spans (e.g., the largest is 26 tokens long). As the warning mentions, this can slow down your training.

Instead, I tried a quick method to start with the records and find what's the top X records that will train. I found I could train on the first few records, but I found the sixth and eleventh records have issues.

6th: "automatic network selection implemented shall possible driver activate deactivate automatic network selection"

11th: "call fails lead traction vehicle lead driver responsible establishing call call fails cab driver cab call lead cab request establishment call"

Nothing stood out; however, I did notice that these records have "negative spans", i.e., at least one of their spans has a "token_end" that is less than its "token_start".

For example:

{
  "text": "call fails lead traction vehicle lead driver responsible establishing call call fails cab driver cab call lead cab request establishment call",
  "spans": [
    {
      "start": 0,
      "end": 32,
      "token_start": 0,
      "token_end": 4,
      "label": "PRECONDITION"
    },
    {
      "start": 33,
      "end": 74,
      "token_start": 2,
      "token_end": 0, # notice this would suggest a "negative" span length
      "label": "POSTCONDITION"
    },
    {
      "start": 75,
      "end": 89,
      "token_start": 0,
      "token_end": 12,
      "label": "PRECONDITION"
    },
    {
      "start": 90,
      "end": 141,
      "token_start": 6,
      "token_end": 0, # another "negative" span length
      "label": "POSTCONDITION"
    }
  ],
  ...
}

Any idea what could have caused this? Did you modify the "token_end"?

Hi @ryanwesslen , Thank you for the answer. I now understand the problem I was facing.

Yes, I forgot to remove the "action." After discussing it with my annotator, we decided to remove it, as the action is already included in the precondition and postcondition.

Yes, I changed the token_start and token_end values because adding a stopword will affect the overall tokens. I created a code to handle this, and it appears that the issue is within the code itself.

This is my code: Preprocessing Function

Perhaps the issue lies with the recount_spans and get_token_index functions. I'll try fix the issues.

I can't thank you enough for you to helping me with this matter :bowing_man: @ryanwesslen