I am looking to train a custom semantics dependency parser but I am not sure if I can specify relations between phrases instead of single words. All the example i have come across annotate dependencies at a word level. I am guessing that merging the phrase into a single token object is not a good idea in my case because the phrase can contain some terms which will hit in the embeddings model and help with the ‘embed’ stage of the model training process using the already existing word2vec model. However the phrase as a single token will not hit anything in the word2vec model I have.
In summary, what is the format i should use to specify relations between multi word phrases for training dependency parse. Or am i missing something Thanks a lot!
PS: Some related info (not question). My plan is to use the mark recipe in prodigy with my custom code driving the phrase identification to get enough lines tagged to get over the cold-start problem and then use the dep.teach recipe after that. At this point i think (and hope) that should work.
The dependency parser works between single tokens, not between whole phrases.
You’re definitely right that merging the phrases will cause the pre-trained vectors to “miss”, and that can be bad. A few possible solutions:
Train new vectors, with phrases instead of words. The
terms.train-vectors recipe has an example of this, it lets you merge noun phrases or named entities. You can merge other phrases you want as well, so long as you can identify them in some post-process.
Alternatively, you could try just not using pre-trained vectors. This is usually much worse when the dataset is small, but it could still be useful as a quick experiment.
You could also add the merged phrases you create to the word vectors, probably using the vector of the last word in the phrase. See here: https://spacy.io/api/vectors#add
I would probably try 2 or 3 as a quick experiment, so that you can keep working on the rest of your task, without stopping to work on the word vectors. Training a custom dependency parser is one of the harder annotation tasks. I think your plan sounds good, but it’s likely to be a much less smooth process than training a new entity type.
You might find that the easiest way to get over the cold-start problem is to just annotate some documents from scratch. We haven’t implemented a front-end for that in Prodigy, because we don’t really have a better solution than the free tools, which can be found here: http://universaldependencies.org/tools.html#third-party-tools
Thank you so much for the detailed response. I am trying this out.