Create PhraseMatcher in Spacy and use them to Label data manually

Hi :blush:

I am a new user of Spacy and prodigy,
I have a question, maybe it is a stupid one!!
I am gonna do the same steps that you have did it in this video,

Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning

However, for the first step, I have created my match patterns with spacy using "PhraseMatcher" because I have a big database that contains exactly all the sentences that I am trying to recognize in any text (text in French). It contains more than 50,000 expressions.

import json
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("fr_core_news_lg")
matcher = PhraseMatcher(nlp.vocab)

with open('medicaments_expression_database.json') as json_file:
    data = json.load(json_file)
    
sentences = data['sentences'] 
patterns = [nlp.make_doc(text) for text in sentences]
matcher.add("medicaments_expressions", None, *patterns)

Is there any way to use these patterns to continue the second step and start labeling data manually with the help of them, the same way as you have used the food_patterns.jsonl in this command :
prodigy ner.manual food_data blank:en ./reddit_r_cooking_sample.jsonl --label INGRED --patterns food_patterns.jsonl

Hi, why don't you just create directly medicaments_patterns.jsonl from
medicaments_expression_database.json so that you can use it as in the video ?

thanks @xia
I am not sure about the way I split the content of the sentences? maybe there is some common way between spacy and prodigy for text splitting!!
I'm trying to get advantage of existing functions in both!
What do you think @ines ?
thanks for support

Under the hood, Prodigy also uses spaCy's matchers and supports both token-based patterns (Matcher) and phrase patterns (PhraseMatcher). You can find examples of the data format here: https://prodi.gy/docs/api-loaders#input-patterns So you'd just have to convert your medicaments_expression_database.json file to a patterns file that looks like this:

{"label": "SOME_LABEL", "pattern": "expression here"}
{"label": "SOME_LABEL", "pattern": "other expression here"}

Are these actual sentences? Because that's not really a good fit for a named entity recognition model that's designed to predict spans like proper nouns (e.g. names, places, products). So the patterns you feed in should contain examples of the entities to help you pre-select them.

1 Like

Thanks for your fast reply @ines
To be more precise. Here are some examples of the "expression" that I am looking to detect in my medical documents. it's a kind of drug composition.

acide acetyl gesdene/bga 15µg cpp
teflaf/usp 0,5mg/ml fli
telmiar/hyd.bga 15µg/10ml

so I thought that with the (PhraseMatcher), I can have for this expression

acide acetyl gesdene/bga 15µg cpp

a pattern like :

{"label": "med_elements",  "pattern": [{"lower":"acide"},  {"lower":"acetyl"}, {"lower": "gesdene/bga"},{"lower": "15µg"},  {"lower": "cpp"}]}

I know I can create a script to generate these patterns from the (medicaments_expression_database.json) file, but why if it is can be done with the PhraseMatcher !!!

The following would also work:

{"label": "med_elements",  "pattern": "acide acetyl gesdene/bga 15µg cpp"}

You might want to consider splitting the phrases into name and dose (is this what the measurements here are? Sorry, I'm not an expert in this domain area :sweat_smile:). This is likely going to be easier for the model to predict than the whole expression.

You can also pre-label your data using the phrase matcher. The "spans" used by Prodigy include the label and start/end character. You can easily get that from the matcher:

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    task_span = {"start": span.start_char, "end": span.end_char, "label": "med_elements"}

Just make sure you filter the spans so you don't end up with any overlaps (e.g. using spaCy's filter_spans utility).

1 Like

Hi @ines
I have tried to create my own patterns from my database in order to use them with the recipe ner.manual

here are some examples of my patterns from the All_patterns.jsonl file

{"label":"Med","pattern":[{"LIKE_NUM": "True"}, {"LOWER" :{"REGEX":"(capsule[s]?)"}}]
{"label":"Med","pattern":[{"LOWER" : {"REGEX":"(comprime[s]?)"}}]}
{"label":"Med","pattern":[{"LOWER": "prendre"},{"LIKE_NUM": "True"}, {"LOWER": "gelule"}]}

In the first moments, I got this error when I have ran the recipe ner.manual

Invalid JSON on line 1: {"label":"Med","pattern":[{"LIKE_NUM": True}, {"LOWER" : {"REGEX":"(capsule[s]?)"}}]} 

I replaced {"LIKE_NUM": True} with {"LIKE_NUM": "True"} which I found weird because the first one worked with Spacy.
Anyway, I launched again this command line

prodigy ner.manual annot_maggg fr_core_news_lg rawText.txt  --loader txt --label Med,Fre,Dur --patterns All_patterns.jsonl 

and None of the patterns have matched with any of the sentences in the rawText.txt

For example, in the rawText.txt file, there is this expression (Prendre 1 gelule d'emblee) that can be matched completely with this pattern
[{"LOWER": "prendre"}, {"LIKE_NUM": "True"}, {"LOWER": "gelule"}]
and unfortunately the system did not suggest it.

it made me disappointed!!
why the same "Rule-based matching" pattern that works in Spacy does not work with prodigy?

The problem here is that Python uses True and False and JSON uses true and false. Similar to how Python uses None and JSON uses null. If you're writing a pattern for spaCy, you're writing it in Python. If you're loading in patterns from a JSON file, you're writing them in JSON.

Under the hood, all Prodigy does is call into spaCy's Matcher. So if a pattern matches in spaCy, it will also match in Prodigy. By "the system did not suggest it", do you mean that you didn't see the example at all, or that you saw the example without a match?

1 Like

thank you @ines for your fast reaction
I changed the all the true and false in the JSON file and it works :slight_smile:

I meant I saw the example without a match. but now it works.

However, I got another error, what's wrong with this pattern?

[{"LOWER" : {"REGEX":"([\d]/[\d])"}}, {"LOWER" : {"REGEX":"(granule[s]?)"}]

I had written this pattern for spacy in python, but prodigy did not accept it

> Invalid JSON on line 1: {"label":"Med","pattern":[{"LOWER" : {"REGEX":"([\d]/[\d])"}}, {"LOWER" : {"REGEX":"(granule[s]?)"}}]}

I think the problem is in [\d]

How I created my JSON ? I wrote my patterns in a python script, then I changed the extension from .py to .jsonl . as I know very well what I am looking for in my texts, therefore I manually create my patterns.

Because it's JSON, I think you have to escape the \, otherwise it's interpreted as an escape character. So: [\\d].

If you have data in Python and want to generate JSON from it, a more convenient way is to use json.dumps. This will take care of all the Python to JSON conversion for you.

1 Like