Hi there! I'm new to SpaCy and Prodigy so bear with me
So I'm not sure if it's a bug or not supported or something else but I'm running into the following issue.
If you take the following Dutch example, neither 'kijk' or 'kijken' are pre-highlighted in Prodigy.
prodigy ner.manual example_data nl_core_news_md ./examples.jsonl --label VISUAL --patterns patterns.jsonl
Examples:
{"text":"Ik kijk uit naar je reactie"}
{"text":"Wij kijken uit naar je reactie"}
Pattern:
{"label": "VISUAL", "pattern": [{"lemma": "kijk"}]}
If I give this a try in English it does seem to work and the data is pre-highlighted:
prodigy ner.manual example_data en_core_web_md ./examples.jsonl --label VISUAL --patterns patterns.jsonl
Examples:
{"text":"I look forward to seeing you"}
{"text":"I'm looking forward to your response"}
Pattern:
{"label": "VISUAL", "pattern": [{"lemma": "look"}]}
But if I run the following python script it does return the lemma it seems for 'kijk', the results do look quite different compared to English though:
doc = nlp("Ik kijk uit naar je reactie")
for word in next(doc.sents):
print(word.lemma_)
ik
kijken
uit
naar
je
reactie
If I run it against the matcher, it returns nothing (while again it does work in English):
nlp = spacy.load("nl")
matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "kijk"}]
matcher.add("KIJK", None, pattern)
doc = nlp("Ik kijk uit naar je reactie")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
# Returns nothing
# Adjusted to English in does work and returns:
# 16028893611387356987 KIJK 2 3 looking
So I'm a bit lost haha. Is the Dutch language supported for this?
I'm also looking for some best practices in regards to training a single model vs multiple models. For instance, if I want a label VISUAL and AUDITORY entities (e.g. "I'm seeing" and "I'm hearing") and the training data is the same what are the pros and cons of using one model versus two separate models.
Thanks!