ner.batch-train does not suggest any match based on the provided pattern file

Hi!!
I am trying to train a NER with a new entity type called “finance”.
The language is Swedish and the model I am using contains:

  • A NER with 2 entities: PERSON and ORGANIZATION.
  • Tagger
  • Vectors

The model is tested in spacy and it works quite well for the above mentioned entities.
Now I want to use Prodigy to train my new category “finance”. These are my steps:

  1. I loaded a list of financial terms in Swedish into a new dataset called “wordlist_ner_swedish_finance”.

    prodigy db-in wordlist_ner_swedish_finance file.txt 
    
  2. I transformed the new dataset in a pattern.json using the ner.to-pattern recepy (the terms are around 150). The result is as follow:

{“label”:“finance”,“pattern”:[{“lower”:“Villkor”}]}
{“label”:“finance”,“pattern”:[{“lower”:“Villkoren”}]}
{“label”:“finance”,“pattern”:[{“lower”:“Villkors\u00e4ndring”}]}
{“label”:“finance”,“pattern”:[{“lower”:“Villkors\u00e4ndringen”}]}
{“label”:“finance”,“pattern”:[{“lower”:“Villkors\u00e4ndringar”}]}
{“label”:“finance”,“pattern”:[{“lower”:“L\u00e5nevillkor”}]}
{“label”:“finance”,“pattern”:[{“lower”:“L\u00e5nevillkoren”}]}
{“label”:“finance”,“pattern”:[{“lower”:“Kontantinsats”}]}
{“label”:“finance”,“pattern”:[{“lower”:“Kontantinsatsen”}]}


  1. I created an empty dataset with name “annotations_ner_swedish_finance” and i started the batch-train process using a 30Mb corpus that contains some of the pattern terms.

    prodigy ner.teach annotations_ner_swedish_finance my_model_folder corpus.txt --patterns finance-term.jsonl
    

The problem here is that the suggestions coming up from the annotation webpage are not relevant at all with the provided patterns. The suggestions include proper names or other names generally starting with capital letter but not at all close to the patterns. So, after 100 suggestions I could not even make an “accepted” annotation. Do you know what could be wrong?

I have also tried different structure on the jsonl pattern file as follow:

{“label”: “finance”, “pattern”: “Villkor”}
{“label”: “finance”, “pattern”: “Villkoren”}
{“label”: “finance”, “pattern”: “Villkorsändring”}
{“label”: “finance”, “pattern”: “Villkorsändringen”}
{“label”: “finance”, “pattern”: “Villkorsändringar”}
{“label”: “finance”, “pattern”: “Lånevillkor”}
{“label”: “finance”, “pattern”: “Lånevillkoren”}
{“label”: “finance”, “pattern”: “Kontantinsats”}
{“label”: “finance”, “pattern”: “Kontantinsatsen”}
{“label”: “finance”, “pattern”: “Låneansökan”}
{“label”: “finance”, “pattern”: “Lånelöfte”}
{“label”: “finance”, “pattern”: “Lånelöftet”}
{“label”: “finance”, “pattern”: “Bostadslån”}
{“label”: “finance”, “pattern”: “Bostadslånet”}
{“label”: “finance”, “pattern”: “Bostadslånen”}
{“label”: “finance”, “pattern”: “Ränterabatt”}

The issue is still the same.
Moreover, I tried to adopt a strategy proposed in the end of this post:

1 . I initialized the annotations_ner_swedish_finance by adding all the terms in the list as “finance”. This is a result of the ner.print-dataset command:

# prodigy ner.print-dataset  annotations_ner_swedish_finance
0.00 	  Villkor  finance 
0.00 	  Villkoren  finance 
0.00 	  Villkorsändring  finance 
0.00 	  Villkorsändringen  finance 
0.00 	  Villkorsändringar  finance 
0.00 	  Lånevillkor  finance 
...
  1. I ran a batch-train model, however without any success, i guess because i dont have enough data to train on (only 150 words):

#prodigy ner.batch-train annotations_ner_swedish_finance my-Model/ --output finance_bootstrap_model

Loaded model Final-Model/
Using 50% of accept/reject examples (71) for evaluation
Using 100% of remaining examples (72) for training
Dropout: 0.2 Batch size: 4 Iterations: 10

BEFORE 0.000
Correct 0
Incorrect 4
Entities 9
Unknown 7

LOSS RIGHT WRONG ENTS SKIP ACCURACY

01 9.801 0 2 70 0 0.000
02 6.624 0 2 11 0 0.000
03 4.754 0 2 44 0 0.000
04 3.871 0 2 20 0 0.000
05 3.090 0 2 27 0 0.000
06 2.776 0 2 30 0 0.000
07 2.633 0 2 34 0 0.000
08 2.156 0 2 27 0 0.000
09 2.402 1 1 28 0 0.500
10 2.661 1 1 50 0 0.500

Correct 1
Incorrect 1
Baseline 0.000
Accuracy 0.500

Model: /prodigy/data/finance_bootstrap_model
Training data: /prodigy/data/finance_bootstrap_model/training.jsonl
Evaluation data: /prodigy/data/finance_bootstrap_model/evaluation.jsonl

3 . I tried to re-run the ner.teach but the results are still the same.

Do you have any suggestions?

Hi! Your first workflow looks good – I think the problem might actually be related to the patterns and the fact that terms.to-patterns assumes that the terms are case-insensitive. For example:

{"label":"finance","pattern":[{"lower":"Villkor"}]}

The above pattern will match all tokens whose lowercase form, e.g. the token.lower_ attribute, equals “Villkor”. This will never be true, because the string in the pattern starts with an uppercase letter. So when you run ner.teach, Prodigy doesn’t find any matches and will just start suggesting random things, since it knows nothing about “finance” yet.

I’d suggest the following: Try again with the same workflow, but lowercase your file.txt (or, if you want case-sensitive matches, change "lower" to "orth").

1 Like

Thank you for your fast and precise reply!
I have fixed the issue and now the ner.teach behaves as expected.
In order to obtain better results I have done 2 things:

1 - Fix the pattern file as you suggested with 2 entries for each word, since I am interested both in the lowercase and the “titled” version of the same word.

2 - Select a smaller file.txt with more matches. This was the biggest mistake. I basically had a huge corpus with few matches and I expected the model to suggest sentences close to the patterns.
I noticed instead that especially in the beginning when the model doesn’t know at all my new class, it is better to use a smaller corpus with sentences as close as possible to my patterns.

1 Like

Yay, glad it works now!

Yes, that sounds reasonable. You definitely want to make sure you start off with enough positive examples so the model can start making meaningful suggestions as early as possible :slightly_smiling_face: