Hi!!
I am trying to train a NER with a new entity type called "finance".
The language is Swedish and the model I am using contains:
- A NER with 2 entities: PERSON and ORGANIZATION.
- Tagger
- Vectors
The model is tested in spacy and it works quite well for the above mentioned entities.
Now I want to use Prodigy to train my new category "finance". These are my steps:
-
I loaded a list of financial terms in Swedish into a new dataset called "wordlist_ner_swedish_finance".
prodigy db-in wordlist_ner_swedish_finance file.txt
-
I transformed the new dataset in a pattern.json using the ner.to-pattern recepy (the terms are around 150). The result is as follow:
{"label":"finance","pattern":[{"lower":"Villkor"}]}
{"label":"finance","pattern":[{"lower":"Villkoren"}]}
{"label":"finance","pattern":[{"lower":"Villkors\u00e4ndring"}]}
{"label":"finance","pattern":[{"lower":"Villkors\u00e4ndringen"}]}
{"label":"finance","pattern":[{"lower":"Villkors\u00e4ndringar"}]}
{"label":"finance","pattern":[{"lower":"L\u00e5nevillkor"}]}
{"label":"finance","pattern":[{"lower":"L\u00e5nevillkoren"}]}
{"label":"finance","pattern":[{"lower":"Kontantinsats"}]}
{"label":"finance","pattern":[{"lower":"Kontantinsatsen"}]}
...
...
...
-
I created an empty dataset with name "annotations_ner_swedish_finance" and i started the batch-train process using a 30Mb corpus that contains some of the pattern terms.
prodigy ner.teach annotations_ner_swedish_finance my_model_folder corpus.txt --patterns finance-term.jsonl
The problem here is that the suggestions coming up from the annotation webpage are not relevant at all with the provided patterns. The suggestions include proper names or other names generally starting with capital letter but not at all close to the patterns. So, after 100 suggestions I could not even make an "accepted" annotation. Do you know what could be wrong?
I have also tried different structure on the jsonl pattern file as follow:
{"label": "finance", "pattern": "Villkor"}
{"label": "finance", "pattern": "Villkoren"}
{"label": "finance", "pattern": "VillkorsÀndring"}
{"label": "finance", "pattern": "VillkorsÀndringen"}
{"label": "finance", "pattern": "VillkorsÀndringar"}
{"label": "finance", "pattern": "LĂ„nevillkor"}
{"label": "finance", "pattern": "LĂ„nevillkoren"}
{"label": "finance", "pattern": "Kontantinsats"}
{"label": "finance", "pattern": "Kontantinsatsen"}
{"label": "finance", "pattern": "LÄneansökan"}
{"label": "finance", "pattern": "LÄnelöfte"}
{"label": "finance", "pattern": "LÄnelöftet"}
{"label": "finance", "pattern": "BostadslÄn"}
{"label": "finance", "pattern": "BostadslÄnet"}
{"label": "finance", "pattern": "BostadslÄnen"}
{"label": "finance", "pattern": "RĂ€nterabatt"}
...
...
The issue is still the same.
Moreover, I tried to adopt a strategy proposed in the end of this post:
1 . I initialized the annotations_ner_swedish_finance by adding all the terms in the list as "finance". This is a result of the ner.print-dataset command:
# prodigy ner.print-dataset annotations_ner_swedish_finance
0.00 Villkor finance
0.00 Villkoren finance
0.00 VillkorsÀndring finance
0.00 VillkorsÀndringen finance
0.00 VillkorsÀndringar finance
0.00 LĂ„nevillkor finance
...
- I ran a batch-train model, however without any success, i guess because i dont have enough data to train on (only 150 words):
#prodigy ner.batch-train annotations_ner_swedish_finance my-Model/ --output finance_bootstrap_model
Loaded model Final-Model/
Using 50% of accept/reject examples (71) for evaluation
Using 100% of remaining examples (72) for training
Dropout: 0.2 Batch size: 4 Iterations: 10BEFORE 0.000
Correct 0
Incorrect 4
Entities 9
Unknown 7LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 9.801 0 2 70 0 0.000
02 6.624 0 2 11 0 0.000
03 4.754 0 2 44 0 0.000
04 3.871 0 2 20 0 0.000
05 3.090 0 2 27 0 0.000
06 2.776 0 2 30 0 0.000
07 2.633 0 2 34 0 0.000
08 2.156 0 2 27 0 0.000
09 2.402 1 1 28 0 0.500
10 2.661 1 1 50 0 0.500Correct 1
Incorrect 1
Baseline 0.000
Accuracy 0.500Model: /prodigy/data/finance_bootstrap_model
Training data: /prodigy/data/finance_bootstrap_model/training.jsonl
Evaluation data: /prodigy/data/finance_bootstrap_model/evaluation.jsonl
3 . I tried to re-run the ner.teach but the results are still the same.
Do you have any suggestions?