NER - basic model doubt

Dear Prodigy community,

Greetings! new to Pordigy tool. I am trying to create a model which will be identify the Personal information like Bank account no,Passport number, Credit card, Car plate number,Email, Phone number,Social security number (NRIC)... these are in context with local country format (Singapore)..For example Passport number start with K. For example ( K1234567P )

Source date: around 3000+text files in TextGrid format -- translated from live conversation (each in 20-30 KB around 100 lines). After some data cleaning, got the "text chunks" from each file and got a Jsonl file. Last week, tried sample data with 200 lines data. Sample line in json line looks like

{"text":"Adrian often used his credit card  8892-1533-2466-0909 to book stays, and Elara coordinated with contacts using her contact 83836890"}

objective is to create empty model from scratch to label Personally Identifiable Info (PII).So the output with Label like

Adrian often used his credit card 8892-1533-2466-0909 <CREDIT CARD> .....her contact 83836890 <Phone>

python3 -m prodigy ner.manual ner_sample blank:en ./focus_input.jsonl --label CREDIT_CARD,PHONE,EMAIL,NRIC --patterns ./entity_patterns.jsonl

Training comand,

python3  -m prodigy train ./models_new --ner ner_sample ----lang en --gpu-id 0

My doubt is

  1. Since this custom model, is it better start with blank:en or is it better en_core_web_sm as baseline model for baseline model for tokenization ? Because, later i may wish to add PERSON label as well.

  2. On either model, i plan to use entity_patterns.jsonl to enforce the pattern

{"label":"PHONE","pattern":[{"text":{"REGEX":"\\d{8}"}}]}`

{"label":"CREDIT_CARD","pattern":[{"text":{"REGEX":"\\d{4}"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": "\\d{4}"}}, {"TEXT": "-"}, {"TEXT": {"REGEX": "\\d{4}"}},{"TEXT": "-"}, {"TEXT": {"REGEX": "\\d{4}"}}]}

If i understand correctly, this make the annotation faster also later the model identify the target label more accurately in either model blank:en or en_core_web_sm . Right?
Without entity_patterns, there is a issue mentioned here for custom entity. Link: Spacy matcher issue

As said, earlier, 200 lines of data seems too little and model accuracy (predicting correct label) is not good. So, have to take decision before scale with 3000- 4000 lines of text.. Please advise.

Cheers!
Chandra

Welcome to the forum @e101sg :wave:

Since this custom model, is it better start with blank:en or is it better en_core_web_sm as baseline model for baseline model for tokenization ? Because, later i may wish to add PERSON label as well.

The answer to the question whether it's better to fine tune or start from scratch typically depends on two criteria:

  1. The similarity between the data that was used to train the pre-trained model and your target data. If the data similar in terms of vocabulary, style , register, punctuation then fine tuning makes sense. Otherwise, it probably won't help much.

  2. The overlap between the categories available in the pre-trained model and your target label set. If you are after a different set of labels, then again, fine tuning won't help much.

In you case it looks like you are looking for labels different than these available in spaCy pretrained pipelines so it probably make sense to start from scratch even though it means that you'll have to annotate a fair amount of data.

If you want to add PERSON later on (which is included in spaCy English models), there's nothing that stops you to use a pre-trained model for speeding-up the annotation of this particular label in a separate annotation pass and retrain.

In any case, since annotation and training are separate steps, once you have annotated your dataset, you could always experiment with both types of training (from scratch and fine-tuning) and see what performs better for your use case.

About the choice of model for fine-tuning:
To find out what dataset was used for training and what labels are available you can consult spaCy docs here. In any case, it is recommended to use a more powerful model than en_core_web_sm for fine-tuning such e.g. en_core_web_md or en_core_web_lg .

In summary, for your use case it's probably best to start from scratch and use patterns and possibly pre-trained pipelines to speed up the annotation.

On to your second question:

If i understand correctly, this make the annotation faster also later the model identify the target label more accurately in either model blank:en or en_core_web_sm . Right?

In the ner.manual workflow the patterns are used to speed up annotation by pre-highlighting the spans. That's correct.
In the ner.teach workflow the predictions from the model and the patterns are combined and both both models are updated based on the annotator's feedback.

I'm not sure what you mean by:

also later the model identify the target label more accurately in either model blank:en or en_core_web_sm

The patterns are used to pre-highlight spans, they do not influence the models inference directly. In the active learning workflows (such as ner.teach ) the final annotations can be influenced by patterns so the patterns would have an effect on the model via annotations but not directly.

1 Like

Got your points, Thanks for quick reply Magda.

i mean the if we annotate with entity_patterns the model will be more accurate what ever the baseline model.. though it annotation (data cleaning) and Training are 2 different process.
I have trained the model around 380 lines of data.I got the following training output. A naive question, is the ENTS_F = F1 score which is 86.75 as seen on the screen shot? If not mistaken ENTS_P = Precision right? Then what is the last column SCORE means? Sorry it may be naive question and wish to learn. Cheers!

@magdaaniol

Got a related doubt regarding ner.teach.Is the correct way

prodigy ner.teach your_New_dataset_name  your_Previous_model  /path/to/New_data.jsonl --label LABEL1,LABEL2

Then again use the your_New_dataset_name on the prodigy train command. Is it correct?
Any suggestion welcome.

Cheers!
Chandra

Hi @e101sg,

Re your first question about the metrics (no need to be sorry at all!):

ENTS_P is the Precision of NER component, that's right.
For the explanation of SCORE and other columns, I'll just refer you this post from Ines:

Re ner.teach command question:

The way you're planning to use it is correct. After the annotation with the model in the loop it is recommended to retrain the model on the entire dataset resulting from the ner.teach session. So, yes you'd be using your_New_dataset_name in train. Our docs on active learning have some more details on why this is the recommended way:

When you annotate with a model in the loop, the model is also updated in the background. So why do you still need to train your model on the annotations afterwards, and can’t just export the model that was updated in the loop? The main reason is that the model in the loop is only updated once each new annotation. This is never going to be as effective as batch training a model on the whole dataset, making multiple passes over the data, shuffling on each epoch and using other deep learning tricks like dropout rates, compounding batch sizes and so on. If you batch train your model with the collected annotations afterwards, you should receive the same model you had in the loop, just better.

1 Like

Thanks a lot for reply. It is very useful.I got the answers...before closing this thread..one more doubt regarding the score for NER component, Is it possible to get Score for (F1, Precision, Accuracy) as shown in this screen shot take from here.

[https://support.prodi.gy/t/how-to-test-my-model-on-new-dataset/2827

  1. So, in my case, i would like to know the trained model's accuracy for labels <CREDIT_CARD> ,
  2. Also is it possible during the the training itself give the evaluation dataset as mentioned here..but i did not understand the full idea or how to do it (Ines replied one of the Forum question)

is not mistaken is it like this
python3 -m prodigy train ./models_sunday --ner ner_sample9 --eval-id Eval_dataset.jsonl --lang en --gpu-id 0

Please advise.

Cheers!
Chandra

Hi @e101sg ,

Re 1)
It sure is possible to get the metrics per label. You just need to call the train command with the --label-stats flag.
You can consult all the other available options here: Built-in Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP

Re 2)
The quoted post is actually a bit outdated. The docs would be the go to reference here. So you pass the eval dataset just like any other dataset only with the prefix eval: . In your example that would be:

python3  -m prodigy train ./models_sunday --ner ner_sample9,eval:Eval_dataset --lang en --gpu-id 0

Please note that the Eval_dataset should exist in the DB, so you would have enter it prior to calling train. You can introduce the dataset from jsonl file like so:

python3 -m prodigy db-in Eval_dataset Eval_dataset.jsonl

Hi @magdaaniol,
Noted the points. Regarding the evaluating test dataset with

python3 -m prodigy train ./models_sunday --ner ner_sample9,eval:Eval_dataset.jsonl

getting error `Can't find 'Eval_dataset.jsonl' in database 'SQLite'. Though the Eval_dataset.jsonl in the local folder .
Also i have tried to import the 'Eval_dataset. jsonl' file into your Prodigy database. (here ner_sample9)

prodigy db-in your_dataset  'Eval_dataset.jsonl

Any thoughts? Still struggling to find the correct way to evaluate the metrics for test dataset. Thanks.

Hi @e101sg ,

You're right, what follows the eval_: prefix on the command line should be the name of the dataset that exists in the DB. There was a typo in my prev message which is now corrected.

So once you've correctly introduced your dataset to the DV with:

prodigy db-in your_dataset  'Eval_dataset.jsonl

assuming the tame of your_dataset, you'd need to update the command to use the actual name of the dataset so:

python3 -m prodigy train ./models_sunday --ner ner_sample9,eval:your_dataset --lang en --gpu-id 0
1 Like
  1. okay. let me confirm again.my evaluation data name is focus_input8.jsonl

python3 -m prodigy db-in focus_input8 focus_input8.jsonl

:heavy_check_mark: Imported 240 annotated examples and saved them to 'focus_input8'
(session 2024-02-14_16-23-54) in database SQLite
Found and keeping existing "answer" in 0 examples
later when i check with prodigy db-out focus_input8
all imported data has accept answers, and 0 for reject & ignore, which is incorrect.
Prodigy Ver 1.14.12
Python Ver 3.10.12
spaCy Ver 3.7.2

  1. Then the train command gives a strange results.
ython3 -m prodigy train --ner ner_final2,eval:focus_input8 ./ner_final2_600_with_test_data --gpu-id 0 --label-stats

  1. Also wonder is it there any Spacy recipe/command to evaluate the Prodigy model. (after convert to certain Spacy format.. Right?? )
    Appreciate your help... I hope this thread will be useful to beginners in the future.

Thanks & cheers!
Chandra

Hi @e

The db-in command checks if the input dataset has the field answer if not it would add this field with a accept value by default. You can control this behavior e.g. say which default value it should use via the --answer argument as explained in the docs here: Built-in Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP. Also see the "important note" there.

When it says "Found and keeping existing "answer" in 0 examples" it means that not a single example has the answer field and all of them were added by Prodigy. This is why when you inspect the dataset after running the db-in it will have the answer accept. If you inspect the first example in your input the the db-in i.e. focus_input8.jsonl it will have not any answer fields.

Then the train command gives a strange results.

I imagine you refer to the stats on the numbers of training and eval examples.
Could it be focus_input8 dataset already has some examples in it? You can access descriptive stats on a dataset by running:

python -m prodigy stats {name of the dataset} 

that should tell you how many examples it has.

Then. when creating the final datasets for training Prodigy will remove all duplicates and conflicting annotations (e.g be preferring manual over binary annotations if that's the conflict). When you see the differences in the raw dataset counts (in the output of the prodigy stats or the "from datasets" log line and the actual dataset counts that make it to the training it's probably best to inspect your dataset for duplicates and conflicting annotations.

Also wonder is it there any Spacy recipe/command to evaluate the Prodigy model. (after convert to certain Spacy format.. Right?? )

The model produced by prodigy train is directly loadable to spaCy. spaCy does have utilities for model evaluation. You can find the docs here: Command Line Interface · spaCy API Documentation
On that note, I'm happy to share that we are finalizing the implementation of evaluation command for Prodigy that should be released sometime next week :slight_smile:

1 Like

Dear All and @ magdaaniol,
Greetings! Edited and re-posting my reply

If you inspect the first example in your input the the db-in i.e. focus_input8.jsonl it will have not any answer fields.

yes, there is no answer fields. So, the message Found and keeping existing "answer" in 0 examples not related to model accuracy. I assumed wrongly it is related. Fine.

Here we assume the evaluation dataset (which is not part of training/test split data) is annotated and added to Database i.e focus_input8.jsonl is annotated and saved into Database as focus_input8 and used with eval: in the prodigy train recipe. Right?

  1. I wonder is it possible to use the focus_input8.jsonl directly with eval argument.
  2. Since the annotated dataset must be saved in database with --answer argument, i purposely reject all the suggestion in the Prodigy for focus_input8.jsonl and save it as focus_input8 and use in prodigy train (to achieve the point 1). :innocent: Anyway look forward the new evaluation command for Prodigy.

3.Meanwhile, i have find out the correct the Spacy command, so, i have used the prodigy data-to-spacy once the .spacy binary dataset created, able to train with spacy train with default config.cfg and able to do spacy benchmark accuracy as well. The Spacy model accuracy is almost same as prodigy.
Finally some joy. :grinning:
Cheers!
e101 / Chandra

Hi @e101sg,

Nope, these have to be names of the datasets in the DB.

Re2. If your jsonl file doesn't have the the answer field, the db-in command will add the default answer of choice. Not sure what was the point of rejecting all the examples :thinking: . The rejected examples, logically, won't be used in training. They will be filtered out by the train command.

In any case - glad to hear your training was successful :tada:

1 Like

Hi Magda,

Thanks for your efforts. This thread got too long. Hope useful for new users. :blush:
Regarding the your question, i wish to do this as posted in this thread.
New thread

Cheers!
Chandra /e101sg

1 Like