Training NER model from scratch using (forward-looking) patterns

Sorry for the many questions - I hope I don't waste too much of your time. I couldn't agree more with this users comment the other day.

For the sake of simplicity let's say i want to make an entity SEK_AMOUNT, e.g. 10 from the expression SEK 10. I'd like to teach a NER model to capture this - this is a toy example.

  1. Using patterns I can easily capture SEK 10 with a few false positives whereas 10 is not possible without getting a lot of false positives. How do you propose to proceed?

  2. In a real world example I have a custom component that uses the EntityRuler. It creates entities but with false positives. However it is a good starting point to collect entities from scratch by using the existing entities plus some logic around those, but the entities from my component should NOT be saved as entities. Should I write my own recipe for this? Probably something close to ner.match?

Thank you.

Off-topic: when do you announce spaCyIRL for 2020? Hopefully you'll continue the great success from this year!

I think a custom recipe would work well for your problem. You could just recognise SEK 10 and then trim the entity down with a rule afterwards. Alternatively you could just have the classification technology recognise the whole phrase, and then only use the numeric part in your application?

I think using the patterns to train a model, probably with a custom recipe so you have better control and can customise things, seems like a good approach.

Great thanks!

Does this seem right to you?

def add_metric_amount(stream):
    for task in stream:
        spans = [
            {"label": "M_AMT", "start": e.start_char, "end": e.end_char}
            for e in nlp(task["text"]).ents
            if e.label_ in (Entity.AmountRange.label, Entity.Amount.label)
        ]
        log(f"Seeing {len(spans)} spans")
        for span in spans:
            task["spans"] = [span]
            yield task


@recipe(
    "ner.custom-match",
    dataset=recipe_args["dataset"],
    source=recipe_args["source"],
    api=recipe_args["api"],
    loader=recipe_args["loader"],
    exclude=recipe_args["exclude"],
    resume=(
        "Resume from existing dataset and update matcher accordingly",
        "flag",
        "R",
        bool,
    ),
)
def custom_match(
    dataset, source=None, api=None, loader=None, exclude=None, resume=False,
):
    log("RECIPE: Starting recipe ner.match", locals())
    DB = connect()

    stream = get_stream(
        source, api=api, loader=loader, rehash=True, dedup=True, input_key="text"
    )
    return {
        "view_id": "ner",
        "dataset": dataset,
        "stream": add_metric_amount(stream),
        "exclude": exclude,
    }

It works fine BUT it seems that it only yields a task for the first matched span and not a task for each matched span. And I'm running out of task although there should be tens of thousands.

I did around 200 annotations and got a model with 40% accuracy (just wanted to sanity check the model) using en_vectors_web_lg. Then I tried ner.teach with the new model but it suggested almost every token as an entity which puzzles me. Then I just rejected a whole lot so I have 265 accepted and 1176 rejected in total. Now when I try to ner.batch-train again I get the following error

ValueError: [E103] Trying to set conflicting doc.ents: '(166, 167, '!M_AMT')' and '(154, 167, '!M_AMT')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

I'm guessing it has to do with the binary annotations task (one task per matched span instead of one task per document)? The initial batch train output puzzles me as well

Loaded model en_vectors_web_lg
Using 50% of accept/reject examples (210) for evaluation
Using 100% of remaining examples (305) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  


BEFORE      0.000            
Correct     0  
Incorrect   24
Entities    0                
Unknown     0   

Those numbers do not correspond to the dataset

Dataset       ner-fresh          
Created       2019-12-12 14:31:12
Description   None               
Author        None               
Annotations   1441               
Accept        265                
Reject        1176               
Ignore        0  

Are you using spaCy v2.2? The handling of binary annotations is currently the only incompatibility with the existing version of Prodigy – we'll be resolving that in the upcoming Prodigy v1.9.

Aha! Yes I'm on v2.2.3.

So I should simply downgrade and then run ner.batch-train for now? My annotations are fine, right?

Thanks @ines. I've downgraded to v2.1.9 and now I have the following outputs

❯ prodigy stats ner-fresh

  ✨  Prodigy stats

Version          1.8.5                         
Location         /Users/nixd-mac/PycharmProjects/venvs/outlook/lib/python3.7/site-packages/prodigy
Prodigy Home     /Users/nixd-mac/.prodigy      
Platform         Darwin-19.0.0-x86_64-i386-64bit
Python Version   3.7.5                         
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   13                            
Total Sessions   96                            


  ✨  Dataset 'ner-fresh'

Dataset       ner-fresh          
Created       2019-12-12 14:31:12
Description   None               
Author        None               
Annotations   1441               
Accept        265                
Reject        1176               
Ignore        0 

and

❯ prodigy ner.batch-train ner-fresh en_vectors_web_lg --output test-model

Loaded model en_vectors_web_lg
Using 50% of accept/reject examples (210) for evaluation
Using 100% of remaining examples (305) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  


BEFORE      0.000            
Correct     0  
Incorrect   24
Entities    0                
Unknown     0                

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           3927.988     0            24           0            0            0.000                                                                                                                                                                               
02           3055.203     3            422          2387         0            0.007                                                                                                                                                                               
03           2406.244     6            450          2449         0            0.013                                                                                                                                                                               
04           1478.541     13           435          2455         0            0.029                                                                                                                                                                               
05           1011.577     12           403          2413         0            0.029                                                                                                                                                                               
06           832.349      13           409          2390         0            0.031                                                                                                                                                                               
07           720.200      13           398          2517         0            0.032                                                                                                                                                                               
08           508.446      14           386          2327         0            0.035                                                                                                                                                                               
09           388.819      12           404          2380         0            0.029                                                                                                                                                                               
10           318.288      12           402          2312         0            0.029                                                                                                                                                                               

Correct     14  
Incorrect   386
Baseline    0.000             
Accuracy    0.035             


Model: /Users/nixd-mac/PycharmProjects/outlook/test-model
Training data: /Users/nixd-mac/PycharmProjects/outlook/test-model/training.jsonl
Evaluation data: /Users/nixd-mac/PycharmProjects/outlook/test-model/evaluation.jsonl

To me there is a mismatch in those outputs, the numbers just doesn't add up. And the accuracy is incredibly bad due to false positives (i.e. almost all tokens are marked as an entity). Any thoughts?

The binary annotations work best for improving an existing model. If you're starting from scratch, often the model struggles to refine the definition of the task, given the weak supervision. So that might be what's happening here. You could try using the --no-missing flag, which declares that any entities that aren't present are incorrect. If your annotations don't have many missing entities, this would probably work quite well.

If your annotations do have a lot of missing entities, you could try the ner.silver-to-gold recipe, here: https://github.com/explosion/prodigy-recipes/blob/master/ner/ner_silver_to_gold.py

Thats exactly what I needed

❯ prodigy ner.batch-train ner-fresh en_vectors_web_lg --output test-model --no-missing

Loaded model en_vectors_web_lg
Using 50% of accept/reject examples (213) for evaluation
Using 100% of remaining examples (314) for training
Dropout: 0.2  Batch size: 10  Iterations: 10  


BEFORE      0.000            
Correct     0  
Incorrect   28
Entities    0                
Unknown     0                

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           758.245      0            29           1            0            0.000                                                                                                                                                                               
02           56.343       2            26           2            0            0.071                                                                                                                                                                               
03           41.102       10           23           15           0            0.303                                                                                                                                                                               
04           36.703       12           22           18           0            0.353                                                                                                                                                                               
05           15.407       10           25           17           0            0.286                                                                                                                                                                               
06           24.160       9            24           14           0            0.273                                                                                                                                                                               
07           14.749       16           24           28           0            0.400                                                                                                                                                                               
08           15.339       14           26           26           0            0.350                                                                                                                                                                               
09           11.776       14           29           29           0            0.326                                                                                                                                                                               
10           9.780        15           25           27           0            0.375                                                                                                                                                                               

Correct     16 
Incorrect   24
Baseline    0.000            
Accuracy    0.400            


Model: /Users/nixd-mac/PycharmProjects/outlook/test-model
Training data: /Users/nixd-mac/PycharmProjects/outlook/test-model/training.jsonl
Evaluation data: /Users/nixd-mac/PycharmProjects/outlook/test-model/evaluation.jsonl

I'm still puzzled by the number of examples used for training and evaluation compared to number of examples in the dataset?