Label design : One or two

Hi there,
I have a rather high level question:
I am trying to train a german label for Locations. Additionally to the obvious words (like "Deutschland") it should also catch words like "deutsche", "deutscher" in Contexts like "deutsche Firma" or "deutscher Herkunft" . Since the latter occur in rather different contexts than the former I wonder if it would be better to create an additional Label (like "Location adjectives"). I worry that with everything in one Label it might get very hard for the model to make sense of it. In your experience, what would be the best approach?
(Btw: I currently trained the model with a single Lable with rather disappointing results.)
Thank you very much!

This is an interesting question! In the Onto Notes 5 corpus (which spaCy's English models are trained on), this is solved by introducing the label NORP (nationalities, religious groups etc.). However, the main problem this solves is that in English, those expressions are typically capitalised (e.g. "a German company") – if they weren't, the label scheme likely wouldn't include a category for those entities at all.

NER is already trickier in German because we capitalise nouns and "capitalised word in the middle of a sentence" is a pretty weak indicator for a named entity (but a pretty strong indicator in English). So if you're trying to teach the model that the same label applies to nouns and adjectives with different capitalisation, it makes sense that it struggles.

So I think it definitely makes sense to try a label scheme with two labels for the different concepts and see how it goes.

Also, if you haven't done it already: create a quick rule-based baseline, e.g. using spaCy's matcher so you know how far you get with just word lists or patterns like {"LEMMA": "deutsch"} or {"TEXT": "...", "POS": "ADJ"} etc. Even if you're not planning on actually using that approach, you want to know what you're "competing" against :wink: If you get 90% accuracy on your data using matcher rules, it's going to be much harder to beat than if your baseline is 60%.

You could also consider using some of the rules for the most obvious cases to augment the model's predictions and assign the most obvious labels that you know are always correct. For nationalities, I think you could get pretty good coverage with a list of patterns – where it gets trickier and where you'd benefit from a model is stuff like "eine stuttgarter/Stuttgarter Firma" etc. :sweat_smile:

1 Like

That's a great idea! Actually haven't thought of that.

Perfect! I will definitely try that.

I am pretty sure this will occur a lot :laughing: But I have a feeling that I could get the largest part of them with rules as well.

Thank you again for the incredibly fast and perfect answer! This is a huge help.

1 Like