No entities found when running ner.batch-train on new NER

I’m trying to create a custom NER with swedish street addresses and labeled it STREET.
I’ve added a patterns file to cover street name followed by a number

{"label": "STREET", "pattern": [{"is_alpha": true},{"is_digit": true}]}

I have also generated a txt file that contains 60k phrases that includes the pattern together with some phrases soI receive some context.
Sample phrases translated to english

 I want to book a cab to Streetname 1
 Can you send me a car to Streetname 2

I run the following command and annotate around 2k

 prodigy ner.teach street swedish-model streetphrases.txt --label STREET --patterns patterns.jsonl

After that I tried with ner.batch-train with the following command and result

prodigy ner.batch-train street new-swedish-model --label STREET --output new_model

Using 1 labels: STREET

Loaded model new-swedish-model
Using 20% of accept/reject examples (282) for evaluation
Using 100% of remaining examples (1132) for training
Dropout: 0.2  Batch size: 16  Iterations: 10  


BEFORE      0.000           
Correct     0 
Incorrect   2
Entities    0               
Unknown     0               

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           2480.805     0            2            1180         0            0.000                                                                                                                       
02           1165.106     0            2            1175         0            0.000                                                                                                                       
03           989.513      0            2            1173         0            0.000                                                                                                                       
04           900.055      1            1            1171         0            0.500                                                                                                                       
05           846.116      1            1            1170         0            0.500                                                                                                                       
06           821.424      1            1            1173         0            0.500                                                                                                                       
07           884.305      1            1            1173         0            0.500                                                                                                                       
08           848.783      2            0            1172         0            1.000                                                                                                                       
09           825.181      1            1            1171         0            0.500                                                                                                                       
10           787.083      2            0            1170         0            1.000                                                                                                                       

Correct     2 
Incorrect   0
Baseline    0.000           
Accuracy    1.000  

What am I doing wrong here?
I thought there was an issue in the dataset so i printed it just to see.

  0.97 	A car to Central Street 51  STREET

That does look suspicious. How many did you accept?

What would probably be most helpful for you is a recipe where you can use the patterns file and then use the ner_manual interface. This is a small gap in the recipe suite at the moment, but it can be easily assembled with the current pieces. Ines gives a code example here: How to use a spaCy pattern in Prodigy

I do wonder why ner.teach didn’t seem to work here though, as what you did seems okay to me. My first thought is maybe there aren’t enough accepts, but if there are plenty of accept cases, I’d like to make sure there isn’t wrong with the recipe.

Ill try the manual way to see if its differ.

Since I did use a good pattern file I got plenty of accepts. Ill guess that around 80% are accepts so there are over 1k accepts.

The majority of addresses contains special characters in the swedish language like åäö but I dont think thats matter right?

Would it be possible for you to email me the data file? I understand it might be sensitive so this might not be possible, but if you can send it I’ll take a look.

Sure! I’ll do a db-out of the dataset and send it to you right away.

@honnibal - I’m curious if you had the time to look at the files I sent you?

Thanks for your patience on this, and thanks for sending across the data files.

There was indeed a bug here, triggered by an unusual aspect of your data. Almost all of your entities end on the last character of the text. In ner.batch-train, the default is to apply a sentence segmentation process (this is actually not a great default, but anyway…).

When applying the sentence segmentation, we need to adjust the entity annotations so that the offsets are correct with respect to the new objects. There was an off-by-one error in this calculation. Bug and patch:


-  if start >= sent.start_char and end < sent.end_char:
+ if start >= sent.start_char and end <= sent.end_char:

This off-by-one error caused entities that ended exactly on the end of the text to be dropped. In most cases, there will be a text-final period or other punctuation, so this off-by-one error isn’t usually triggered – but on your data it caused almost all the entities to be dropped!

You can work around the error right away by applying the --unsegmented argument to `ner.batch-train. When I run:

python -m prodigy ner.batch-train merik1 blank:sv --label STREET --no-missing --unsegmented

I’m getting:

Using 1 labels: STREET

Loaded model sv
Using 20% of accept/reject examples (282) for evaluation
Using 100% of remaining examples (1132) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE      0.000
Correct     0
Incorrect   269
Entities    0
Unknown     0

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY
01           705.941      267          6            271          0            0.978
02           17.706       264          5            264          0            0.981
03           17.273       266          5            268          0            0.982
04           8.809        266          4            267          0            0.985
05           5.087        266          3            266          0            0.989
06           9.109        266          5            268          0            0.982
07           11.586       266          6            269          0            0.978
08           7.822        266          5            268          0            0.982
09           7.467        266          5            268          0            0.982
10           2.366        266          6            269          0            0.978

Correct     266
Incorrect   3
Baseline    0.000
Accuracy    0.989

So, at least that’s some good news — it looks like the model learns your annotations very well!

I’ve also fixed the bug in Prodigy, and the fix should be released shortly.

Just released v1.8.3 that should fix this – could you try again with the new version? :slightly_smiling_face: