Training a parser for custom semantics

I work on relations extraction and found this topic very helpful. I am wondering whether it is possible to leverage the power of Prodigy to train a new parser model with custom dependencies with Prodigy?

It seems impossible at present to use something conceptually similar to

prodigy dep.manual [dataset] en_core_web_sm tweets.jsonl --dependencies ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION

Once trained, the new parser should perform as:

texts = ["find a hotel with good wifi"]
    docs = nlp.pipe(texts)
    for doc in docs:
        print(doc.text)
        print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != '-'])


# Expected output:
# find a hotel with good wifi
# [
#   ('find', 'ROOT', 'find'),
#   ('hotel', 'PLACE', 'find'),
#   ('good', 'QUALITY', 'wifi'),
#   ('wifi', 'ATTRIBUTE', 'hotel')
# ]

Any creative ideas?

Thanks.

Yay, coming up with creative ideas for annotation problems is one of my favourite things on the forum :smile:

You’re right that we haven’t really solved the “manual dependency annotation” yet. Labelling all dependencies and presenting them in a way that is visually “useful” turned out to be super difficult, actually. But here are some alternative ideas, focusing on the best efficiency vs. data quality trade offs:

Use ner.manual with “head” and “child” labels

If your label scheme is small (like the one in the example), you could get away with using ner.manual with two labels per dependency: PLACE_HEAD and PLACE_CHILD. You could also use the "labels" setting in your custom Prodigy theme to give them different colours, so it’s easier to keep an overview as you’re annotating.

Once you have the data, you’ll need to convert it to the correct format for training the parser, but that should be relatively easy.

Binary annotations assisted by POS tags

Giving binary feedback on one dependency at a time using the dep interface might still be more efficient, because it makes it easier to see what’s going on and focus on one relationship at a time.

Even without knowing anything about the data, you could do this by creating one example per possible combination of tokens per label – but this would obviously be way too much and create all kinds of nonsense examples you’d have to reject constantly. To narrow in on only possible candidates, you could use the part-of-speech tags: for example, you could only create candidate examples for combinations of VERB → NOUN (label PLACE), or NOUN → ADJ (label QUALITY). This might even turn up some interesting ambiguities or edge cases in your data that you wouldn’t have noticed otherwise.

Based on that data, you could then create annotations in Prodigy’s dep format (see the README for examples), and then annotate that data by streaming it into the mark recipe with --view-id dep.

If the POS tags aren’t very good on your data (e.g. if you’re annotating dialog texts or something like that), you can always use pos.teach to improve the model, until you have perfectly tuned predictions on your domain. This will be super useful anyways, especially if your data is non-standard.

I’d say the biggest obstacle here is to get over the “cold start” – after that, you can use dep.teach to improve your model, without having to label too much by hand.

1 Like

Hi @Andrey

I am just about to start the same journey as you. Just wondering if you have any experience you’d like to share regarding training a parser for custom semantics?

Thanks