Training dependency parser

Hi everyone,

I’m trying to use spaCy and prodigy to read resumes, by identifying its entities and the relationships between the entities as a dependency tree, which of course will be custom semantics. I’m using spaCy’s named entity recognizer bootstrapped with around 400 annotated resumes, that also have the relationships annotated. Due to the nature of my domain I should only have one tree per resume, making the entire resume the sentence. In order to do this, my model’s pipeline includes the named entity recognizer, a custom component for merging the entities as single tokens, a custom component for making the whole document a single sentence, and finally the parser.

I’m having trouble setting up this environment so that I can then keep training it with prodigy.

I’ve tried two approaches.

In one I tried having it all as part of a single Language instance, and use that model for both the ner and the dependency recipes, but the python kernel keeps dying and restarting when it’s supposed to parse. The details of how I attempted this can be found here.

In my second approach I tried having two different Language instances, one for the ner, and one for the parser. I’d then train each model with the appropriate recipe. I would then use a function that processes the text with the ner model, merges the entities, and leaves the doc as a single sentence as the make_doc of the parser Language instance. The error I am now getting with this approach is “Could not find a gold-standard action to supervise thedependency parser. The GoldParse was projective. The transition system has 207 actions.”. The details of this implementation can be found here.

My question is, am I approaching the problem wrong or are spaCy’s models simply not appropriate for the problem I’m working on? And if they’re not, would it be possible for me to create a custom dependency parsing model and train it with prodigy?

Thank you

1 Like

tl;dr: Can you give a sample of the text with the heads and dependencies you've assigned as the gold standard?

From a theoretical perspective, your approach does make sense. However the implementation of the parser is designed towards syntactic relationships, which are normally between words fairly close together. So, the features in the parser might not work well for your task, and it might be unexpectedly slow.

The other thing about the parsing algorithm is that it's fairly complicated. If you haven't seen it yet, I would suggest you have a look at this blog post: Parsing English in 500 Lines of Python · Explosion . The post is old, but the parser still uses the same transition-based approach supervised by a "dynamic oracle". It's just that we use a neural network to optimise, instead of this simpler ML algorithm.

These two sections are especially relevant:

The gist of this is that we're setting up an initial state that has a stack, a queue of words, and a set of dependency arcs. We then define some fixed set of actions that we'll use to transition from one state to another. This lets us map the parsing task to the task of predicting a set of actions that end in with a desirable parse tree.

The important thing to understand for your error is that the training algorithm requires us to assign a "cost" to each potential action, where the cost is the number of additional errors that action would introduce. In other words: What's the score of the best parse we can make from this state? Okay, what's the score of the best parse we can make if we apply this action to this state? The cost of the action is the difference between the two.

So, your error is saying that given the state you're in and the gold-parse you've assigned to the sentence, none of the actions result in zero cost. This might occur if there's no way to derive the gold-standard you assigned, given the actions the parser has (for instance if it doesn't have a label in your gold-standard).

Another thing to keep in mind is that the parser has to build a tree that covers every token. It's possible to underspecify the tree, in which case the parser will have no guidance for some of the arcs. In your case, if you're trying to only extract relations between some entities in a whole document, you might end up with the vast majority of tokens underspecified. This will probably be very hard for the parser to learn from.

Thank you so much for your response.

It’s clear to me why my approach wasn’t working. I’m wondering if I could modify the parser to only attempt to predict relationships between entities and already know which tokens have the “-” dependency, or if I could internally reduce the document to just the entities. I also know that most combinations of entity - dependency can never happen and I know some dependencies in advance (for example my entity SKILL can only have a direct dependency to the ROOT with the dependency “SKILL”). My guess is that if I can somehow modify your code to take those rules into consideration I could get it to work. Do you think that’s possible or would I be better off designing my own parser that better fits my needs?

I was going to suggest making a new Doc object with just the entities, and perhaps some contextual tokens that you think are particularly important. I think that's a great plan. You just need to track an alignment into the original document -- something like alignment = {i: span.idx for i, span in enumerate(doc.ents)}. This way you can recover the character offset in the original doc once you have the dependencies.

In theory, yes you can change the is_valid functions. These are stored by function pointer within the parser.moves.c array, which is of type Transaction*. It's fiddly though: it's easy to make a mistake and end up with constraints which prevent the parser from finding any valid actions.

Another approach is to use your constraints to evaluate the predicted dependencies, or to propose new ones for self training. This would encourage the statistical model to "internalize" your constraints in the weights.

Version 1.4 of Prodigy will have experimental support for dependency annotation, using a dep.teach recipe. I hope you'll try it out once it's available, and let us know how it goes. I'd like to add an API for adding parser constraints in spaCy as well. I think it'll really help make the parser more generally applicable, to problems such as yours.

For the annotation part, see my comment on this thread:

Just released v1.4.0, which comes with a dependency annotation interface and (still experimental) dep.teach, dep.batch-train and dep.train-curve recipes! :tada: See here for a demo of the new interface.

1 Like