Does terms.train need the pipeline if not merge_entities and not merge_nounchunks?

madhujahagirdar · March 13, 2018, 3:55pm

I have been using prodigy to train on 5 million document records and the current rate at which it is processing it will take 6-7 days, which is quite a long time. I was trying to optimize this a bit. When I disable all pipelines except senteniczer the performance is hugely better. My question is, if I am not using merge_entities or merge_nounchunks and my only use case right now is text classification, can I disable all the pipelines except senteniczer and does impact the accuracy of the classifier or ?

honnibal · March 14, 2018, 1:18pm

If you’re not using merge_entities and you’re not using merge_nounchunks, yes you should disable the other pipeline components. We’ll look at doing this automaticaly.

madhujahagirdar · March 14, 2018, 1:26pm

Disable the tagging (POS) as well ?

honnibal · March 14, 2018, 2:09pm

Sure, yes.

madhujahagirdar · March 14, 2018, 2:13pm

Merged from thread Pipeline Sequence

@honnibal @ines

Is there any sequence in which I should add the pipeline.
For custom terms.train I am adding the following sequence and getting error:

 nlp = spacy.blank(lang)
 print("Using blank spaCy model ({})".format(lang))
 nlp.add_pipe(nlp.create_pipe('sentencizer'))
 nlp.add_pipe(nlp.create_pipe('tagger'))
 nlp.add_pipe(nlp.create_pipe('parser'))

Using blank spaCy model (en)
22:06:50 - 'pattern' package not found; tag filters are not available for English
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/prodigy/__main__.py", line 254, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 152, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/prodigy/recipes/terms.py", line 57, in train_vectors
    for doc in nlp.pipe((eg['text'] for eg in stream)):
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py", line 564, in pipe
    for doc in docs:
  File "pipeline.pyx", line 400, in pipe
  File "pipeline.pyx", line 405, in spacy.pipeline.Tagger.predict
AttributeError: 'bool' object has no attribute 'tok2vec

Additionally, if I use only sentencizer and create word2vec model and if I dont use merge_entities and merge_nounchunks in my textcat.batchtrain would it have any impact on the classificaion performance. With all the pipeline components, for 5 million documents its taking 5-7 days. Any help would be appreciated.

honnibal · March 14, 2018, 2:15pm

The terms.teach recipe only trains word vectors. There’s no reason to add blank tagger, parser or entity objects to the pipeline — they won’t be trained. The specific error you’re seeing above is due to not calling nlp.begin_training() to initialize the models; however the real problem is that you shouldn’t be adding these components to the pipeline in this recipe.

Topic		Replies	Views
prodigy pipeline usage usage , spacy , solved	4	1127	July 3, 2019
Recipe ner.batch-train results in ValueError: [E030] usage , ner , spacy , solved	10	2442	June 25, 2019
Training dependency parser usage , ner , done , spacy	5	3870	March 11, 2018
Custom loaders usage	6	1959	August 16, 2024
Prodigy with AllenNLP model usage , allennlp	3	917	March 4, 2021

Does terms.train need the pipeline if not merge_entities and not merge_nounchunks?

Related topics