Using Rule Based matching with Greek language in Spacy

I have text in Greek and want to isolate phrases in capital letters in quotation marks.

For a reproducible example, the text is the following:

document = '''ΓΕΝΙΚΗ ΓΡΑΜΜΑΤΕΙΑ ΕΜΠΟΡΙΟΥ & ΠΡΟΣΤΑΣΙΑΣ ΚΑΤΑΝΑΛΩΤΗ ΓΕΝΙΚΗ Δ/ΝΣΗ ΑΓΟΡΑΣ Δ/ΝΣΗ ΕΤΑΙΡΕΙΩΝ ΤΜΗΜΑ ΕΠΟΠΤΕΙΑΣ ΕΙΣΗΓΜΕΝΩΝ Α.Ε. & ΑΘΛΗΤΙΚΩΝ Α.Ε. Ταχ. Δ/νση: Πλ. Κάνιγγος Ταχ. Κώδικας: 101 81 Πληροφορίες: Μ. Κανά Τηλέφωνο: 2103893566 Fax: 2103838981 e-mail: kana@gge.gr Αθήνα, 25.07.2019 Αρ. Πρωτ.: 1577914 ΑΝΑΚΟΙΝΩΣΗ Καταχώρισης στο Γενικό Εμπορικό Μητρώο στοιχείων της Ανώνυμης Εταιρείας με την επωνυμία «ΓΡΗΓΟΡΗΣ ΣΑΡΑΝΤΗΣ ΑΝΩΝΥΜΗ ΒΙΟΜΗΧΑΝΙΚΗ ΚΑΙ ΕΜΠΟΡΙΚΗ ΕΤΑΙΡΕΙΑ ΚΑΛΛΥΝΤΙΚΩΝ, ΕΝΔΥΜΑΤΩΝ, ΟΙΚΙΑΚΩΝ ΚΑΙ ΦΑΡΜΑΚΕΥΤΙΚΩΝ ΕΙΔΩΝ» Ανακοινώνεται ότι την 25 .07.2019 καταχωρίσθηκε στο Γενικό Εμπορικό Μητρώο (Γ.Ε.ΜΗ) με Κωδικό Αριθμό Καταχώρησης 1802156. η με αριθμό 780001/25.07.2019 απόφασή μας (ΑΔΑ: ΨΝΚ1465ΧΙ82ΒΜ), με την οποία εγκρίθηκε η τροποποίηση εν συνόλω του καταστατικού, της ανώνυμης εταιρείας με την επωνυμία «ΓΡΗΓΟΡΗΣ ΣΑΡΑΝΤΗΣ ΑΝΩΝΥΜΗ ΒΙΟΜΗΧΑΝΙΚΗ ΚΑΙ ΕΜΠΟΡΙΚΗ ΕΤΑΙΡΕΙΑ ΚΑΛΛΥΝΤΙΚΩΝ, ΕΝΔΥΜΑΤΩΝ, ΟΙΚΙΑΚΩΝ ΚΑΙ ΦΑΡΜΑΚΕΥΤΙΚΩΝ ΕΙΔΩΝ», και αριθμό ΓΕΜΗ 255201000 (πρώην ΑΡ. ΜΑΕ 13083/06/Β/86/27), σύμφωνα με την από 18-6-2019 απόφαση της Τακτικής Γενικής Συνέλευσης των μετόχων της, στο πλαίσιο της εναρμόνισης με το ν. 4548/2018 «Αναμόρφωση του δικαίου των ανωνύμων εταιρειών». Το εν λόγω καταστατικό με ημερομηνία 18/6/2019 αποτελείται από 28 άρθρα, ως αυτά διαλαμβάνονται στα κεφάλαια A έως H αυτού. Την ίδια ημερομηνία καταχωρίσθηκε στο Γενικό Εμπορικό Μητρώο ολόκληρο το νέο κείμενο καταστατικού μαζί με τις τροποποιήσεις του. Ο ΠΡΟΙΣΤΑΜΕΝΟΣ ΤΗΣ ΔΙΕΥΘΥΝΣΗΣ ΙΩΑΝΝΗΣ ΑΡΕΤΑΙΟΣ'''

My code is the following:

     nlp_el =  spacy.load('el_core_news_md')
     doc = nlp_el(document)
     pattern = [{'POS':'PUNCT', 'TEXT':'«'}, {'POS':'NOUN', 'IS_UPPER':True, 'OP': '?'}, {'POS':'PUNCT', 'TEXT':'»'}]
     # Add the pattern to the matcher and apply the matcher to the doc
     matcher.add("UPPER CASE", None, pattern)
     matches = matcher(doc)
     print("Total matches found:", len(matches))

    # Iterate over the matches and print the span text
     for match_id, start, end in matches:
        print("Match found:", doc[start:end].text)

The output is the following, which is not what I aimed for:

image

I expected to isolate the following phrase:

«ΓΡΗΓΟΡΗΣ ΣΑΡΑΝΤΗΣ ΑΝΩΝΥΜΗ ΒΙΟΜΗΧΑΝΙΚΗ ΚΑΙ ΕΜΠΟΡΙΚΗ ΕΤΑΙΡΕΙΑ ΚΑΛΛΥΝΤΙΚΩΝ, ΕΝΔΥΜΑΤΩΝ, ΟΙΚΙΑΚΩΝ ΚΑΙ ΦΑΡΜΑΚΕΥΤΙΚΩΝ ΕΙΔΩΝ»

How do you explain the situation and what you would propose to remedy it?

Hi! This forum is mostly focused on Prodigy, so we can't provide individual help with spaCy-only tasks.

I'm not sure what spaCy version you're using, but I can't reproduce the results you're seeing. If I run your code, I'm getting 0 matches, which is expected, because the pattern doesn't match. If your patterns don't produce the intended results, try inspecting the tokens and their attributes that you're trying to match on. In this case, {'POS':'NOUN', 'IS_UPPER':True} is too specific and won't match: not all uppercase tokens between those quotes are predicted as nouns. Similarly, punctuation characters are not considered uppercase (consistent with Python's isupper string method). So there's no sequence of tokens that your pattern applies to.

Thank you for the reply. I understand that this Forum targets persons that use Prodigy but I am planning to become such a user and that in the context of a task I want to solve. In case you can point me to another forum apart from stackoverflow where I can discuss SPACY related issues I would appreciate it. I wonder why you can not replicate my results. Any ways your points are valid. But how would you write the query to isolate this text span? More generally, can I use Regex syntax in SPACY to define patterns? (I have made a separate post with this question as well).

Last but not least, I am getting very unreliable results with the Greek module, see below:

image

All the terms highlighted with yellow are errors. Can you advise me how I could deal with that? I understand that Stanford NLP has also a Greek module and perhaps it is more reliable when it comes to POS (I have not tested yet). But it lacks NER functionality. Is there a possibility to combine the two? (Stanford NLP and SPACY).

Thank you again.

I will need to support manual annotation with Rule Based Entity Recognition to render it more effective. To that end, is there a way to employ Regex full functionality in constructing Entity Patterns?

Okay, I've merged both of your topics onto this thread to keep the discussion in one place. Stack Overflow is currently the best place to ask general usage questions around spaCy – it has the biggest community and many questions have been answered before. I also hope that the spacy documentation is useful.

You can find details on the Greek model, including the data it was trained on, in the release details. It seems to be doing okay on that corpus, but how a model performs on your specific data is always a different question and depends on how similar it is to the training data. So you might need to fine-tune it.

Here are some resources that should answer your questions:

To solve the maching problem, maybe double-check the spaCy version you're running and use spacy validate to make sure the model is compatible. And I would avoid making the pattern too specific and only depend on token attributes that are absolutely necessary. Or use regular expressions, as described in the docs.

Thank you. I will follow these leads.