Problem with creating a new entity in swedish

mikael · November 23, 2018, 12:29pm

Im trying to create a new entity for broadband related products and started out with creating a pattern file.
The problem occurs at the end when trying to run the new model on new or existing phrases. It identifies all words in the sentence as label product. I start with a complete new empty model.

Step 1. Create the pattern file and save it to service_pattern.jsonl

{"label":"product","pattern":[{"lower":"telefoni"}]}
{"label":"product","pattern":[{"lower":"telefoni"}]}
{"label":"product","pattern":[{"lower":"bredband"}]}
{"label":"product","pattern":[{"lower":"bredbandsabonnemang"}]}
{"label":"product","pattern":[{"lower":"bredbandsanslutning"}]}
{"label":"product","pattern":[{"lower":"internet"}]}

Step 2 :I import a file with suitable phrases and start the annotation tool. In this case I use a file with 50 sentences but I have been using a file with thousands of rows with the same result.

prodigy ner.teach ner_swedish_products sv-model service-phrases.txt
--patterns service_pattern.jsonl

Step 3: batch-train and output to a new model

  prodigy ner.batch-train ner_swedish_products sv-model 
  --output product_bootstrap_model

Step 4: The training performs well with accuracy of 1.0

Accuracy 1.000

Problem: The problem comes when I try to use the new model as input when i ner.teach on new phrases.
Prodigy identifies all words inside the phrase as products.

Even when Im using the service-phrases.txt above it missmatch.

What am I doing wrong?

ines · November 24, 2018, 12:17pm

Yes, an accuracy of 1.0 is always suspicious! Your workflow sounds alright – could you post the full results after training that includes all statistics?

mikael · November 26, 2018, 9:53am

I cant reproduce the scenario with the same 1.0 accuracy but I started over with a new dataset and fresh model. The result is the same but with lower accuracy. I also added a new batch of training after the first round but it makes no difference.

Created a new dataset with name ner_product and start annotating with the same patterns file.

prodigy dataset ner_product
Successfully added 'ner_product' to database SQLite

prodigy ner.teach ner_product sv-empty-product-model service-phrases.txt --patterns 
service_pattern.jsonl

Saved 41 annotations to database SQLite
Dataset: ner_product
Session ID: 2018-11-26_10-36-03

Result of the fist training round

prodigy ner.batch-train ner_product sv-empty-product-model --output product_model
Loaded model sv-empty-product-model
Using 50% of accept/reject examples (12) for evaluation
Using 100% of remaining examples (12) for training
Dropout: 0.2  Batch size: 4  Iterations: 10  

BEFORE     0.222     
Correct    2
Incorrect  7
Entities   47        
Unknown    44   

Correct    7
Incorrect  4
Baseline   0.222     
Accuracy   0.636

Annotating the new model with other phrases

prodigy ner.teach ner_product product_model service-phrases3.txt
Saved 103 annotations to database SQLite
Dataset: ner_product
Session ID: 2018-11-26_10-40-04

Run batch training again from empty model

prodigy mikaeleriksson$ prodigy ner.batch-train ner_product sv-empty-product-model --output` 
product_model_ver2

Loaded model sv-empty-product-model
Using 50% of accept/reject examples (23) for evaluation
Using 100% of remaining examples (24) for training
Dropout: 0.2  Batch size: 4  Iterations: 10  

BEFORE     0.000     
Correct    0
Incorrect  20
Entities   74        
Unknown    64    

Correct    6
Incorrect  7
Baseline   0.000     
Accuracy   0.462

It is almost the same result if i use a few examples or if i use thousands of them.

When I evaluate the model the model still thinks that punctuation, words like “jag(I), och (and), eller(or)” and so on are products.

ines · November 26, 2018, 11:20am

Do you have any full results and/or examples you can share from when you used larger dataset with thousands of examples? The problem here is that it’s super difficult to draw any conclusions from the results, because you’re only training from 24 examples. Even if the result is similar to the result you see if you train with thousands of examples, it might be for completely different reasons.

Topic		Replies	Views
No entities found when running ner.batch-train on new NER ner , done	7	825	June 7, 2019
Following NER annotation flowchart. Questions on new model and patterns file usage , ner	2	533	August 30, 2019
ner.batch-train does not suggest any match based on the provided pattern file ner	3	673	September 25, 2018
[Request] best practice for bootstrapping data for training partially new Named Entites? (and a question about PhraseMatcher ) usage , ner , spacy , best-practices , training	3	295	February 16, 2024
Training NER model from scratch using (forward-looking) patterns usage	8	690	December 17, 2019

Problem with creating a new entity in swedish

Related topics