Yay, coming up with creative ideas for annotation problems is one of my favourite things on the forum
You’re right that we haven’t really solved the “manual dependency annotation” yet. Labelling all dependencies and presenting them in a way that is visually “useful” turned out to be super difficult, actually. But here are some alternative ideas, focusing on the best efficiency vs. data quality trade offs:
ner.manual with “head” and “child” labels
If your label scheme is small (like the one in the example), you could get away with using
ner.manual with two labels per dependency:
PLACE_CHILD. You could also use the
"labels" setting in your custom Prodigy theme to give them different colours, so it’s easier to keep an overview as you’re annotating.
Once you have the data, you’ll need to convert it to the correct format for training the parser, but that should be relatively easy.
Binary annotations assisted by POS tags
Giving binary feedback on one dependency at a time using the
dep interface might still be more efficient, because it makes it easier to see what’s going on and focus on one relationship at a time.
Even without knowing anything about the data, you could do this by creating one example per possible combination of tokens per label – but this would obviously be way too much and create all kinds of nonsense examples you’d have to reject constantly. To narrow in on only possible candidates, you could use the part-of-speech tags: for example, you could only create candidate examples for combinations of
QUALITY). This might even turn up some interesting ambiguities or edge cases in your data that you wouldn’t have noticed otherwise.
Based on that data, you could then create annotations in Prodigy’s
dep format (see the README for examples), and then annotate that data by streaming it into the
mark recipe with
If the POS tags aren’t very good on your data (e.g. if you’re annotating dialog texts or something like that), you can always use
pos.teach to improve the model, until you have perfectly tuned predictions on your domain. This will be super useful anyways, especially if your data is non-standard.
I’d say the biggest obstacle here is to get over the “cold start” – after that, you can use
dep.teach to improve your model, without having to label too much by hand.