Ideas on how to make rel_component work for long German cooking texts?

Hi!

We are trying to make the experimental rel_component ( projects/tutorials/rel_component at v3 · explosion/projects (github.com)) work on our annotated data set containing German cooking instructions ( German Recipes Dataset | Kaggle).

Our performance is very poor (Max score = 9%). How the rel_component works internally with thinc and the effects of the various settings in config files is a little over our head. Having sunk countless hours into Google and fruitless experiments, we were wondering if you could give us some general hints on what to change.

Compared to the example used in your spacy-nightly tutorial, the main differences in terms of input data are:

  • Our texts are much longer

  • We have 3 types relations instead 2

  • We have 8 types of entities

  • We have many more relations per doc

  • Our relations might be a bit more heterogeneous.

We are not sure if we should, for instance, tweak settings in the config, use pretrained vectors or try to change the model architecture /make the model bigger.

I know this is quite a broad question, but we would be happy about any help. If more information is needed, I would be happy to provide it :smiley:

3 Likes

Hi Cornelius,

It's always difficult to give general advice on specific use-cases, but I'll try to brainstorm a little out loud.

As far as my German takes me, I wonder a little about your annotation scheme, looking at the screenshot. I understand "ARG0" points to ingredients, usually connected to a verb. "ARG" points to additional information on how to cary out the instruction, like "fine" (for "chopping") but also seems to be used to label adjectives like "big" (for "carrots"). And then "ARG1" seems to point to "tools" of some sort - a bowl etc.

I can understand trying to link ingredients to verbs, and having modifying information with those verbs (like "in olive oil"). But the annotation goes much further and highlights prepositions as single "entities" (e.g. "in") or words like "dann". The granularity at which you've annotated these, almost starts looking like a dependency parse with part-of-speech tags annotated such as prepositions and adjectives.

In fact, I'm starting to wonder whether it wouldn't be more beneficial to you to train a tagger & parser on this type of data, and then use that information to deduce the relations you're looking for. For instance, if you can identify a clear cooking verb like "chopping", the nouns that are the objects connected to that verb are probably your ingredients. You could also try running a pretrained parser, but I would guess that you'd need at least some kind of fine-tuning on this specific data, as cooking instructions are generally a little different then sentences from news articles orso.

If you do still want to go the "REL route", I think you'd definitely benefit from incorporating part-of-speech tags / dependency parsing information into your classification model. I realise that's not entirely straightforward to implement, and we don't have a current example. But in most relation extraction challenges, you want the classifier to pick up on the grammar in the sentence, and just using word embeddings by themselves might not be sufficient.

FYI - we're currently preparing a tutorial video on the REL example from the nightly docs, that will explain the datastructures & flow in the Thinc model better. We hope this will help people dive into the specifics of the models and tune them according to their use-case.

But for this specific use-case, I think my first advice would be to try and see whether you can't cast this as a dependency parsing challenge.

1 Like