training with full dataset

cristianmtr · September 4, 2020, 12:58pm

Hey

Is there any way to train with the entire dataset? Isn't this a recommended a way to make sure you include all annotations, and you don't lose anything to the eval split? (This, of course, after you are done evaluating architectures/params)

The best workaround I've found is to have a low split (like 0.1). But I can't say 0.

Using 186 train / 0 eval (split 0%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Traceback (most recent call last):
  File "/home/cristian/anaconda3/envs/prodigy/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/cristian/anaconda3/envs/prodigy/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cristian/anaconda3/envs/prodigy/lib/python3.6/site-packages/prodigy/__main__.py", line 60, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 213, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/cristian/anaconda3/envs/prodigy/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/cristian/anaconda3/envs/prodigy/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "train.py", line 138, in train
    baseline = nlp.evaluate(eval_data)
  File "/home/cristian/anaconda3/envs/prodigy/lib/python3.6/site-packages/spacy/language.py", line 677, in evaluate
    docs, golds = zip(*docs_golds)
ValueError: not enough values to unpack (expected 2, got 0)

Thanks

ines · September 4, 2020, 1:40pm

Hi! To train a model, you'll need at least some examples to evaluate on, otherwise there's no way to show you results. The eval_split exists for quick experiments if you don't have a dedicated evaluation set and just want to hold back some data – but once you're getting serious about training and evaluating, you probably want to have a separate dataset containing the annotation to evaluate on and pass that in as the --eval-id, instead of holding back a random portion of the examples.

Topic		Replies	Views
Error running textcat.batch-train if text is empty string textcat , done	16	1696	November 20, 2017
--eval-id in textcat.batch-train not working in 1.8 textcat , done	2	517	May 21, 2019
textcat.batch-train error "operands could not be broadcast together..." textcat , spacy	4	616	September 24, 2019
Train eval split usage	1	616	March 25, 2019
textcat.batch-train question	7	495	November 28, 2022

training with full dataset

Related topics