Dear Prodigy community,
Greetings! new to Pordigy tool. I am trying to create a model which will be identify the Personal information like Bank account no,Passport number, Credit card, Car plate number,Email, Phone number,Social security number (NRIC)... these are in context with local country format (Singapore)..For example Passport number start with K. For example ( K1234567P )
Source date: around 3000+text files in TextGrid format -- translated from live conversation (each in 20-30 KB around 100 lines). After some data cleaning, got the "text chunks" from each file and got a Jsonl file. Last week, tried sample data with 200 lines data. Sample line in json line looks like
{"text":"Adrian often used his credit card 8892-1533-2466-0909 to book stays, and Elara coordinated with contacts using her contact 83836890"}
objective is to create empty model from scratch to label Personally Identifiable Info (PII).So the output with Label like
Adrian often used his credit card 8892-1533-2466-0909 <CREDIT CARD> .....her contact 83836890 <Phone>
python3 -m prodigy ner.manual ner_sample blank:en ./focus_input.jsonl --label CREDIT_CARD,PHONE,EMAIL,NRIC --patterns ./entity_patterns.jsonl
Training comand,
python3 -m prodigy train ./models_new --ner ner_sample ----lang en --gpu-id 0
My doubt is
-
Since this custom model, is it better start with blank:en or is it better en_core_web_sm as baseline model for baseline model for tokenization ? Because, later i may wish to add PERSON label as well.
-
On either model, i plan to use entity_patterns.jsonl to enforce the pattern
{"label":"PHONE","pattern":[{"text":{"REGEX":"\\d{8}"}}]}`
{"label":"CREDIT_CARD","pattern":[{"text":{"REGEX":"\\d{4}"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": "\\d{4}"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": "\\d{4}"}},{"TEXT": "-"}, {"TEXT": {"REGEX": "\\d{4}"}}]}
If i understand correctly, this make the annotation faster also later the model identify the target label more accurately in either model blank:en or en_core_web_sm . Right?
Without entity_patterns, there is a issue mentioned here for custom entity. Link: Spacy matcher issue
As said, earlier, 200 lines of data seems too little and model accuracy (predicting correct label) is not good. So, have to take decision before scale with 3000- 4000 lines of text.. Please advise.
Cheers!
Chandra