Custom semantics dependency parser training for multi word phrases

SecML · November 13, 2018, 7:37am

I am looking to train a custom semantics dependency parser but I am not sure if I can specify relations between phrases instead of single words. All the example i have come across annotate dependencies at a word level. I am guessing that merging the phrase into a single token object is not a good idea in my case because the phrase can contain some terms which will hit in the embeddings model and help with the ‘embed’ stage of the model training process using the already existing word2vec model. However the phrase as a single token will not hit anything in the word2vec model I have.

In summary, what is the format i should use to specify relations between multi word phrases for training dependency parse. Or am i missing something Thanks a lot!

PS: Some related info (not question). My plan is to use the mark recipe in prodigy with my custom code driving the phrase identification to get enough lines tagged to get over the cold-start problem and then use the dep.teach recipe after that. At this point i think (and hope) that should work.

honnibal · November 13, 2018, 1:52pm

The dependency parser works between single tokens, not between whole phrases.

You're definitely right that merging the phrases will cause the pre-trained vectors to "miss", and that can be bad. A few possible solutions:

Train new vectors, with phrases instead of words. The terms.train-vectors recipe has an example of this, it lets you merge noun phrases or named entities. You can merge other phrases you want as well, so long as you can identify them in some post-process.
Alternatively, you could try just not using pre-trained vectors. This is usually much worse when the dataset is small, but it could still be useful as a quick experiment.
You could also add the merged phrases you create to the word vectors, probably using the vector of the last word in the phrase. See here: Vectors · spaCy API Documentation

I would probably try 2 or 3 as a quick experiment, so that you can keep working on the rest of your task, without stopping to work on the word vectors. Training a custom dependency parser is one of the harder annotation tasks. I think your plan sounds good, but it's likely to be a much less smooth process than training a new entity type.

You might find that the easiest way to get over the cold-start problem is to just annotate some documents from scratch. We haven't implemented a front-end for that in Prodigy, because we don't really have a better solution than the free tools, which can be found here: UD tools

SecML · November 19, 2018, 7:27pm

Thank you so much for the detailed response. I am trying this out.

Topic		Replies	Views
Training dependency parser for multi-word entities usage , spacy , dep , finance	6	1701	June 27, 2019
Multiple phrase dependencies parser usage , dep	1	820	October 7, 2019
Training Dependency parsing with sparse annotations usage , dep , best-practices	1	541	August 28, 2020
Training dependency parser usage , ner , done , spacy	5	3879	March 11, 2018
How to handle multiple concepts in the same phrase joined by a conjunction ner , spacy , best-practices	2	636	June 1, 2021

Custom semantics dependency parser training for multi word phrases

Related topics