Hi @ryanwesslen, thank you for your response. I apologize for my delayed reply.
Sure, I have noticed two errors that appear here (in two different preprocessed datasets)
=========================== Initializing pipeline ===========================
[2023-03-06 22:58:53,473] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 230 | Evaluation: 75 (25% split)
Training: 224 | Evaluation: 75
Labels: spancat (5)
[2023-03-06 22:58:53,511] [INFO] Pipeline: ['tok2vec', 'spancat']
[2023-03-06 22:58:53,513] [INFO] Created vocabulary
[2023-03-06 22:58:53,514] [INFO] Finished initializing nlp object
[2023-03-06 22:58:53,631] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
✔ Initialized pipeline
============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 230 | Evaluation: 75 (25% split)
Training: 224 | Evaluation: 75
Labels: spancat (5)
ℹ Pipeline: ['tok2vec', 'spancat']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE
--- ------ ------------ ------------ ---------- ---------- ---------- ------
⚠ Aborting and saving the final best model. Encountered exception:
IndexError('index -7099 is out of bounds for axis 0 with size 7096')
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\prodigy\__main__.py", line 62, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 379, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\prodigy\recipes\train.py", line 289, in train
return _train(
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\prodigy\recipes\train.py", line 209, in _train
spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\training\loop.py", line 122, in train
raise e
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\training\loop.py", line 105, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\training\loop.py", line 203, in train_while_improving
nlp.update(
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1164, in update
proc.update(examples, sgd=None, losses=losses, **component_cfg[name]) # type: ignore
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\spancat.py", line 347, in update
scores, backprop_scores = self.model.begin_update((docs, spans))
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 309, in begin_update
return self._func(self, X, is_train=True)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\concatenate.py", line 44, in forward
Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\concatenate.py", line 44, in <listcomp>
Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "C:\Users\daffa\AppData\Roaming\Python\Python39\site-packages\thinc\layers\reduce_last.py", line 18, in forward
Y = cast(OutT, Xr.dataXd[ends]) # type: ignore
IndexError: index -7099 is out of bounds for axis 0 with size 7096
and
=========================== Initializing pipeline ===========================
[2023-03-06 22:57:40,994] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 3 | Evaluation: 0 (25% split)
Training: 3 | Evaluation: 0
Labels: spancat (3)
[2023-03-06 22:57:41,005] [INFO] Pipeline: ['tok2vec', 'spancat']
[2023-03-06 22:57:41,007] [INFO] Created vocabulary
[2023-03-06 22:57:41,008] [INFO] Finished initializing nlp object
[2023-03-06 22:57:41,039] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
✔ Initialized pipeline
============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 3 | Evaluation: 0 (25% split)
Training: 3 | Evaluation: 0
Labels: spancat (3)
ℹ Pipeline: ['tok2vec', 'spancat']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE
--- ------ ------------ ------------ ---------- ---------- ---------- ------
Segmentation fault
Yes, I altered the annotations with preprocessing steps. After discussing with my professor, we concluded that we still need to perform preprocessing steps based on our research hypothesis. Is it still possible to train the model with preprocessed datasets?
Sure, I can provide one of my customizable preprocessed datasets.
datasets: Preprocessing datasets · GitHub