Does terms.train need the pipeline if not merge_entities and not merge_nounchunks?

terms
spacy

(Madhu Jahagirdar) #1

I have been using prodigy to train on 5 million document records and the current rate at which it is processing it will take 6-7 days, which is quite a long time. I was trying to optimize this a bit. When I disable all pipelines except senteniczer the performance is hugely better. My question is, if I am not using merge_entities or merge_nounchunks and my only use case right now is text classification, can I disable all the pipelines except senteniczer and does impact the accuracy of the classifier or ?


(Matthew Honnibal) #2

If you’re not using merge_entities and you’re not using merge_nounchunks, yes you should disable the other pipeline components. We’ll look at doing this automaticaly.


(Madhu Jahagirdar) #3

Disable the tagging (POS) as well ?


(Matthew Honnibal) #4

Sure, yes.


(Madhu Jahagirdar) #5

Merged from thread Pipeline Sequence

@honnibal @ines

Is there any sequence in which I should add the pipeline.
For custom terms.train I am adding the following sequence and getting error:

 nlp = spacy.blank(lang)
 print("Using blank spaCy model ({})".format(lang))
 nlp.add_pipe(nlp.create_pipe('sentencizer'))
 nlp.add_pipe(nlp.create_pipe('tagger'))
 nlp.add_pipe(nlp.create_pipe('parser'))

Using blank spaCy model (en)
22:06:50 - 'pattern' package not found; tag filters are not available for English
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/prodigy/__main__.py", line 254, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 152, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/prodigy/recipes/terms.py", line 57, in train_vectors
    for doc in nlp.pipe((eg['text'] for eg in stream)):
  File "/home/madhujahagirdar/cnn-annotation/venv/lib/python3.5/site-packages/spacy/language.py", line 564, in pipe
    for doc in docs:
  File "pipeline.pyx", line 400, in pipe
  File "pipeline.pyx", line 405, in spacy.pipeline.Tagger.predict
AttributeError: 'bool' object has no attribute 'tok2vec

Additionally, if I use only sentencizer and create word2vec model and if I dont use merge_entities and merge_nounchunks in my textcat.batchtrain would it have any impact on the classificaion performance. With all the pipeline components, for 5 million documents its taking 5-7 days. Any help would be appreciated.


(Matthew Honnibal) #6

The terms.teach recipe only trains word vectors. There’s no reason to add blank tagger, parser or entity objects to the pipeline — they won’t be trained. The specific error you’re seeing above is due to not calling nlp.begin_training() to initialize the models; however the real problem is that you shouldn’t be adding these components to the pipeline in this recipe.