Prodigy Lemma support in Dutch - NER patterns

Hi there! I'm new to SpaCy and Prodigy so bear with me :innocent:
So I'm not sure if it's a bug or not supported or something else but I'm running into the following issue.

If you take the following Dutch example, neither 'kijk' or 'kijken' are pre-highlighted in Prodigy.

prodigy ner.manual example_data nl_core_news_md ./examples.jsonl --label VISUAL --patterns patterns.jsonl

Examples:
{"text":"Ik kijk uit naar je reactie"}
{"text":"Wij kijken uit naar je reactie"}

Pattern:
{"label": "VISUAL", "pattern": [{"lemma": "kijk"}]}

If I give this a try in English it does seem to work and the data is pre-highlighted:

prodigy ner.manual example_data en_core_web_md ./examples.jsonl --label VISUAL --patterns patterns.jsonl

Examples:
{"text":"I look forward to seeing you"}
{"text":"I'm looking forward to your response"}

Pattern:
{"label": "VISUAL", "pattern": [{"lemma": "look"}]}

But if I run the following python script it does return the lemma it seems for 'kijk', the results do look quite different compared to English though:

doc = nlp("Ik kijk uit naar je reactie")

for word in next(doc.sents):
    print(word.lemma_)

ik
kijken
uit
naar
je
reactie

If I run it against the matcher, it returns nothing (while again it does work in English):

nlp = spacy.load("nl")
matcher = Matcher(nlp.vocab)

pattern = [{"LEMMA": "kijk"}]
matcher.add("KIJK", None, pattern)

doc = nlp("Ik kijk uit naar je reactie")

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id] 
    span = doc[start:end]  
    print(match_id, string_id, start, end, span.text)

# Returns nothing

# Adjusted to English in does work and returns:
# 16028893611387356987 KIJK 2 3 looking

So I'm a bit lost haha. Is the Dutch language supported for this?

I'm also looking for some best practices in regards to training a single model vs multiple models. For instance, if I want a label VISUAL and AUDITORY entities (e.g. "I'm seeing" and "I'm hearing") and the training data is the same what are the pros and cons of using one model versus two separate models.

Thanks!

Hi! So if I read this correctly, the problem you're seeing is that a pattern isn't being picked up in Prodigy, even though the lemma matches the token's lemma when you process the same text directly in spaCy? Under the hood, Prodigy calls into spaCy's Matcher, so if spaCy produces a match, Prodigy should definitely detect the same match.

When you processed the text in Python, did you use the same model (nl_core_news_md) and the same spaCy version? It looks like you're calling nlp = spacy.load("nl"), which is typically the shortcut for the _sm model, so the model you're using here might be different? Also, in your Dutch matcher example, you're using [{"LEMMA": "kijk"}] – shouldn't that be [{"LEMMA": "kijken"}]? At least, that's the lemma produced by the lemmatizer for that sentence. So it makes sense that a pattern with kijk doesn't match.

The Dutch lemmatization is based on the part-of-speech tags and lookup tables, so depending on the model version and the text, you may see different results. So it can also happen that the model gets something wrong and produces an incorrect tag or lemma. But in that case, you should see that when you inspect token.lemma_.