I'm using spacy config.cfg to train my model
...
[nlp]
lang = "xx"
pipeline = ["ner","entity_ruler"]
disabled = []
...
[initialize.components.entity_ruler.patterns]
@readers = "srsly.read_jsonl.v1"
path = "my_patterns.jsonl"
skip = false
In my_patterns.jsonl file, I wrote all the patterns using the srsly.write_jsonl. The thing is the output file will transform all the True flag to true. Which are correct for the JSON format but not recognized by python.
One example of pattern could be {"label": "my_label" , "pattern": [{"LOWER": {"IN": [list of keywords]}},{"IS_SPACE": True, "OP":"?"}, {"IS_PUNCT": True, "OP": "?"}, {"IS_DIGIT": True}]}
This will be transformed using srsly to
{"label": "my_label" , "pattern": [{"LOWER": {"IN": [list of keywords]}},{"IS_SPACE": true, "OP":"?"}, {"IS_PUNCT": true, "OP": "?"}, {"IS_DIGIT": true}]}
And when training the model, the label is not recognized.
In the other hand, I tried to write a simple code :
import spacy
# Import the Matcher
from spacy.matcher import Matcher
# Load a model and create the nlp object
nlp = spacy.load("xx_ent_wiki_sm")
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
pattern = [{"LOWER": {"IN": [list of keywords]}},{"IS_SPACE": True, "OP":"?"}, {"IS_PUNCT": True, "OP": "?"}, {"IS_DIGIT": True}]
matcher.add("my_label", [pattern])
# Process some text
doc = nlp("Hello, this is a keyword from the keyword list")
# Call the matcher on the doc
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])
This code works and can detect the keyword from the doc. But, when training the model using the config.cfg and my_patterns.jsonl file. It is not working. How can I fix this?