Yay, coming up with creative ideas for annotation problems is one of my favourite things on the forum
You’re right that we haven’t really solved the “manual dependency annotation” yet. Labelling all dependencies and presenting them in a way that is visually “useful” turned out to be super difficult, actually. But here are some alternative ideas, focusing on the best efficiency vs. data quality trade offs:
Use ner.manual
with “head” and “child” labels
If your label scheme is small (like the one in the example), you could get away with using ner.manual
with two labels per dependency: PLACE_HEAD
and PLACE_CHILD
. You could also use the "labels"
setting in your custom Prodigy theme to give them different colours, so it’s easier to keep an overview as you’re annotating.
Once you have the data, you’ll need to convert it to the correct format for training the parser, but that should be relatively easy.
Binary annotations assisted by POS tags
Giving binary feedback on one dependency at a time using the dep
interface might still be more efficient, because it makes it easier to see what’s going on and focus on one relationship at a time.
Even without knowing anything about the data, you could do this by creating one example per possible combination of tokens per label – but this would obviously be way too much and create all kinds of nonsense examples you’d have to reject constantly. To narrow in on only possible candidates, you could use the part-of-speech tags: for example, you could only create candidate examples for combinations of VERB
→ NOUN
(label PLACE
), or NOUN
→ ADJ
(label QUALITY
). This might even turn up some interesting ambiguities or edge cases in your data that you wouldn’t have noticed otherwise.
Based on that data, you could then create annotations in Prodigy’s dep
format (see the README for examples), and then annotate that data by streaming it into the mark
recipe with --view-id dep
.
If the POS tags aren’t very good on your data (e.g. if you’re annotating dialog texts or something like that), you can always use pos.teach
to improve the model, until you have perfectly tuned predictions on your domain. This will be super useful anyways, especially if your data is non-standard.
I’d say the biggest obstacle here is to get over the “cold start” – after that, you can use dep.teach
to improve your model, without having to label too much by hand.