No entities found when running ner.batch-train on new NER

mikael · May 29, 2019, 3:05pm

I’m trying to create a custom NER with swedish street addresses and labeled it STREET.
I’ve added a patterns file to cover street name followed by a number

{"label": "STREET", "pattern": [{"is_alpha": true},{"is_digit": true}]}

I have also generated a txt file that contains 60k phrases that includes the pattern together with some phrases soI receive some context.
Sample phrases translated to english

 I want to book a cab to Streetname 1
 Can you send me a car to Streetname 2

I run the following command and annotate around 2k

 prodigy ner.teach street swedish-model streetphrases.txt --label STREET --patterns patterns.jsonl

After that I tried with ner.batch-train with the following command and result

prodigy ner.batch-train street new-swedish-model --label STREET --output new_model

Using 1 labels: STREET

Loaded model new-swedish-model
Using 20% of accept/reject examples (282) for evaluation
Using 100% of remaining examples (1132) for training
Dropout: 0.2  Batch size: 16  Iterations: 10  


BEFORE      0.000           
Correct     0 
Incorrect   2
Entities    0               
Unknown     0               

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           2480.805     0            2            1180         0            0.000                                                                                                                       
02           1165.106     0            2            1175         0            0.000                                                                                                                       
03           989.513      0            2            1173         0            0.000                                                                                                                       
04           900.055      1            1            1171         0            0.500                                                                                                                       
05           846.116      1            1            1170         0            0.500                                                                                                                       
06           821.424      1            1            1173         0            0.500                                                                                                                       
07           884.305      1            1            1173         0            0.500                                                                                                                       
08           848.783      2            0            1172         0            1.000                                                                                                                       
09           825.181      1            1            1171         0            0.500                                                                                                                       
10           787.083      2            0            1170         0            1.000                                                                                                                       

Correct     2 
Incorrect   0
Baseline    0.000           
Accuracy    1.000

What am I doing wrong here?
I thought there was an issue in the dataset so i printed it just to see.

  0.97 	A car to Central Street 51  STREET

honnibal · May 29, 2019, 3:27pm

That does look suspicious. How many did you accept?

What would probably be most helpful for you is a recipe where you can use the patterns file and then use the ner_manual interface. This is a small gap in the recipe suite at the moment, but it can be easily assembled with the current pieces. Ines gives a code example here: How to use a spaCy pattern in Prodigy

I do wonder why ner.teach didn’t seem to work here though, as what you did seems okay to me. My first thought is maybe there aren’t enough accepts, but if there are plenty of accept cases, I’d like to make sure there isn’t wrong with the recipe.

mikael · May 29, 2019, 4:04pm

Ill try the manual way to see if its differ.

Since I did use a good pattern file I got plenty of accepts. Ill guess that around 80% are accepts so there are over 1k accepts.

The majority of addresses contains special characters in the swedish language like åäö but I dont think thats matter right?

honnibal · May 29, 2019, 5:24pm

Would it be possible for you to email me the data file? I understand it might be sensitive so this might not be possible, but if you can send it I’ll take a look.

mikael · May 29, 2019, 5:34pm

Sure! I’ll do a db-out of the dataset and send it to you right away.

mikael · June 3, 2019, 6:34am

@honnibal - I’m curious if you had the time to look at the files I sent you?

honnibal · June 7, 2019, 10:33am

Thanks for your patience on this, and thanks for sending across the data files.

There was indeed a bug here, triggered by an unusual aspect of your data. Almost all of your entities end on the last character of the text. In ner.batch-train, the default is to apply a sentence segmentation process (this is actually not a great default, but anyway…).

When applying the sentence segmentation, we need to adjust the entity annotations so that the offsets are correct with respect to the new objects. There was an off-by-one error in this calculation. Bug and patch:


-  if start >= sent.start_char and end < sent.end_char:
+ if start >= sent.start_char and end <= sent.end_char:

This off-by-one error caused entities that ended exactly on the end of the text to be dropped. In most cases, there will be a text-final period or other punctuation, so this off-by-one error isn’t usually triggered – but on your data it caused almost all the entities to be dropped!

You can work around the error right away by applying the --unsegmented argument to `ner.batch-train. When I run:

python -m prodigy ner.batch-train merik1 blank:sv --label STREET --no-missing --unsegmented

I’m getting:

Using 1 labels: STREET

Loaded model sv
Using 20% of accept/reject examples (282) for evaluation
Using 100% of remaining examples (1132) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE      0.000
Correct     0
Incorrect   269
Entities    0
Unknown     0

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY
01           705.941      267          6            271          0            0.978
02           17.706       264          5            264          0            0.981
03           17.273       266          5            268          0            0.982
04           8.809        266          4            267          0            0.985
05           5.087        266          3            266          0            0.989
06           9.109        266          5            268          0            0.982
07           11.586       266          6            269          0            0.978
08           7.822        266          5            268          0            0.982
09           7.467        266          5            268          0            0.982
10           2.366        266          6            269          0            0.978

Correct     266
Incorrect   3
Baseline    0.000
Accuracy    0.989

So, at least that’s some good news — it looks like the model learns your annotations very well!

I’ve also fixed the bug in Prodigy, and the fix should be released shortly.

ines · June 7, 2019, 11:04am

Just released v1.8.3 that should fix this – could you try again with the new version?

Topic		Replies	Views
ner.batch-train does not suggest any match based on the provided pattern file ner	3	675	September 25, 2018
ner.teach not giving relevant entities from patterns jsonl ner , done	21	2844	October 2, 2018
Problem with creating a new entity in swedish usage , ner	3	448	November 26, 2018
Training NER model from scratch using (forward-looking) patterns usage	8	692	December 17, 2019
KeyError: 'text' when using ner.batch-train usage , ner , solved	6	938	February 13, 2019

No entities found when running ner.batch-train on new NER

Related topics