relation between Tagger, Parser, and NER pipeline in spacy

I have a question for the relation among the pipelines. In my understanding, a parser could benefit from the result of the tagger, and so for NER from parsing. However, in the current version, it seems that training these three pipelines are very independent , e.g., you can train a parser before a tagger, and vice versa. It would be very appreciated that if you can explain the relation, and the techniques in tagging and parsing, if possible.

I think all the pipelines should at least share the word embeddings. But I guess they share more. So does updating one pipeline affects other ones? If so, is that possible to avoid this; for example, I have trained the tagger and parser, and now I add some new annotation on the training data and want to train NER.

The last question is actually related pretrain and how this functionality affect training parser. In my understanding, pretrain is to use a pretrained word embedding vector to train a contextual representation model (here for spacy are CNN filters? like the transformers layers in BERT and the BiLSTM layers in ELMO), using the unlabeled raw text from which the annotated/labeled training data is draw. So how the output of pretrain could be used for training the tagging, parsing and NER pipelines?

Looking forward to the help!

1 Like

Hi! In spaCy v2.x, the pipeline components are separate, so your observation is correct. So the entity recognizer doesn't use any features like part-of-speech tags or anything else set by the other components. They only share the embeddings and word vectors, if available. So if you train your model with one set of word vectors and then remove or replace them, the model will likely perform really badly.

(Btw, if you're interested in more details on the feature and haven't seen it yet, you might find this video on the NER model helpful. )

No, there's no direct interaction and an update to one component would never update the weights of another.

However, it is possible that updates to one component can change the output of another. For example, by default, the parser will assign the sentence boundaries. The named entity recognizer is constrained by the sentence boundaries, so it'll never predict entities that cross a sentence boundary (which makes sense). So if you update the parser and it ends up predicting different sentence boundaries, you could theoretically end up with different entity predictions. But that's only because you redefined the constraints for the predictions – the NER weights themselves didn't actually change.

To make it run fast and keep the model size small, spaCy's implementation (LMAO) uses a CNN to predict the vector of each word given its context. So we're not predicting the actual word, just its rough meaning, which is easier, and lets us leverage existing pre-trained word embeddings. For a quick overview and some results and examples, see my slides here:

In spaCy 2.1+, you can use the pretrain command to create a tok2vec (token to vector) artifact, that you can initialise a model with. Those pre-trained representations will then be shared by all components in the pipeline. The next update of Prodigy will introduce support for spaCy 2.1 and also for training with tok2vec artifacts. (See this thread for details and progress on the update. Since it's a breaking change that'll require all Prodigy users to retrain their models, we want to make sure to fix some outstanding spaCy bugs and test everything before we publish the update.)

1 Like

Firstly, many thanks for the clearness and kindness here.

W.R.T the mutual independent relation among pipelines, would you give some detailed technique inspiration for the current framework. I am kind of curious the theory basis of CNN-oriented parser and tagger in Spacy. Meanwhile, can I build multiple pipelines for the same utility, say, multiple NERS?

Thanks for your help again.

We considered several factors when designing the model architectures. The most important factors were speed and accuracy, but also how well the model might generalise to other text. I think CNNs are a bit less brittle to differences like sentential vs fragmentary inputs. If you train a BiLSTM on sentential inputs, and test it on inputs that are fragments like headlines or catalogue entries, the LSTM state can be different from the start and that can propagate over the whole sequence. A CNN isn’t like that: if you’re at a slice in the middle, it doesn’t matter that the beginning was different. So a CNN has some advantage in generalising to unexpected situations.

Thanks for your answer @honnibal and @ines. If I understand the answers correctly, spaCy now uses pretrain to create t2v artifact, on which the tagger/parser/NER are initialized in the same way (more specifically, the very bottom CNN layers).