NER for short unstructured text, what am I doing wrong?

Hello All,

I’m currently dealing with short text observations (up to 140 characters), essentially representing bank transactions. I am trying to extract relevant entities from the text, using spaCy’s NER.

My current workflow is the following one:

  1. Create a blank model
  2. Create a pattern file, providing some initial knowledge (limited to exact matches)
  3. Collect the training data using ner.train (Source data contains 1000 observations/transactions)
  4. Annotate until “No tasks available” (In my case 1500 annotations)
  5. ner.batch-train (using default settings) and produce the output model

After the model has been created, I present the model with some observations the model has already seen and successfully recognized during the teaching part and I don’t get good results.

For the model to recognize, I have defined 2 custom entity types, let’s call them MERCHANT and TRXTYPE (Transaction Type).

TRXTYPE is easily recognized since there is a fixed number of Transaction types.
However, the MERCHANT label is rarely assigned, even though the patterns file have provided certain names that can be found quite often (consider supermarkets for instance).

Maybe I am doing something wrong, could you please evaluate and shed some light on it? :slight_smile:

Thank you in advance!

Hi! Your workflow sounds good :slightly_smiling_face: Could you share some more examples of your patterns and maybe some texts? It’s possible that the answer is that you’re not doing anything wrong per se, but that the label scheme is just difficult to learn given the data.

Is this during annotation or after training when you try the model in spaCy?


Here are the examples, with my annotations (mind the format in which labels are provided):
For this type of transaction, there are numerous examples, with the TRXTYPE label being constant (many debit card transactions) and the MERCHANT label being usually different (FARMACIA DEL CORS is just one of many, many merchants). Essentially, everything after token ‘IL’ should be labelled as a MERCHANT.

For this type of transaction, there are also numerous examples, with the TRXTYPE label being constant and the MERCHANT label being usually different (SUPERMERCAT IPERAL is just one of many, many merchants). Usually, the part that comes after the merchant is constant (VIA) and the payment service provider variates (in this case VINCENZO PAGAMENTO CONTACTLESS).

Patterns file looks like this (just few example that I posted here) (including just few examples that I provided in order to bootstrap the teaching process and provide model with some initial information.
{“label”: “TRXTYPE”,“pattern”: [{“lower”: “carta”}, {“lower”: “di”},{“lower”: “debito”}]}
{“label”: “TRXTYPE”,“pattern”: [{“lower”: “sepa”}, {“lower”: “direct”},{“lower”: “debit”}]}
{“label”: “TRXTYPE”,“pattern”: [{“lower”: “prel”}, {“lower”: “pagam”},{“lower”: “carta”}]}
{“label”: “MERCHANT”,“pattern”: “PIZZA EXPRESS”}
{“label”: “MERCHANT”,“pattern”: “ENERGIA ELETTRICA”}
{“label”: “MERCHANT”,“pattern”: “SUPERMERCAT IPERAL”}

I noticed that during the teaching process, spaCy understands that everything after (for the first example I provided) ‘IL’ token (which designates where the transaction occurred) should be labelled as the MERCHANT, and I accept those suggestions.

However, when I load the model, after ner.batch-train recipe, the model does not recognize the same thing, even when provided with the same text seen during the training and does not label it as the MERCHANT.

Many thanks!

Hmm, it’s not so easy to see what’s going wrong here. What sort of accuracies does the ner.batch-train print?

Using 30% of accept/reject examples (158) for evaluation
Using 100% of remaining examples (371) for training
Dropout: 0.2 Batch size: 16 Iterations: 10

BEFORE 0.059
Correct 6
Incorrect 95
Entities 622
Unknown 599

01 4.722 15 70 117 0 0.176
02 3.990 30 68 149 0 0.306
03 2.234 23 72 86 0 0.242
04 6.625 48 44 238 0 0.522
05 5.278 41 50 256 0 0.451
06 3.041 60 27 139 0 0.690
07 4.043 68 16 158 0 0.810
08 4.396 70 18 181 0 0.795
09 3.073 75 10 152 0 0.882
10 1.723 66 20 132 0 0.767

Correct 75
Incorrect 10
Baseline 0.059
Accuracy 0.882


The training data is small, but it does seem to learn, so I’m puzzled why spaCy wouldn’t continue to predict the entities correctly. Is it possible to email me the dataset?

If not, I can try to debug from a distance a bit more. You might try using the ner.print-best command after training, to more easily inspect the output of the model over the training data. The print-best command uses the annotations you’ve made as constraints, and predicts the best matching parse given the model you read in.

Have a look at the print-best output over the training and development files, which are output into the model directory after ner.batch-train. In theory you should be seeing that the accuracy over the development data is good (the accuracy figures look good, after all). If the output looks bad, then that’s a good clue about what’s going on.

If the output of ner.print-best looks reasonably good but not perfect, a useful step would be to pipe the output into ner.manual, saving the result into a new dataset. During the ner.manual process, you’ll be able to edit the suggestions, so you can get proper gold-standard data, instead of the model’s best guesses.

Once you’ve vetted the output and saved it in a new dataset, when you run ner.batch-train, you’ll be able to use the --no-missing flag. This tells spaCy the annotations it’s learning from are complete and correct, which lets it make stronger assumptions.

Hi! Unfortunately, I’m unable to share the data, even though I have provided some examples above.

However, the ner.print-best command’s output does not look very good. Majority of the predictions for the most frequent observations (let’s call them ‘CARTA DI DEBITO’ bank transactions) do not contain the labelled MERCHANT part.

Does that mean I have to provide more patterns through the initial patterns file? During the ner.teach process, spaCy did not ask many questions regarding the MERCHANT label for that kind of observations (relatively, given that the source data contains more than 25% of these type of observations).

Thank you, Matthew!

Okay, I think it’s best to fix up the entities manually, so that you can move forward with a simpler situation. This also makes sure you’re producing annotations you can export to another tool, or later use as evaluation data.

Above I suggested you pipe ner.print-best into ner.manual, but we also have a single recipe for this, in the recipes repo: . You would use this with a command like this:

prodigy ner.silver-to-gold <previous dataset> <new dataset> <spacy model> -F prodigy-recipes/ner/

I think it’s probably better to clone the repo and use the file, because then it’ll be easy for you to edit it if you want to make customisations. For instance, you might find you want to use your patterns to make automated corrections. You can do that by writing a new function that modifies for the stream. For instance, let’s say you’re getting a lot of entities that are four words or more, and you know that’s impossible in your dataset, but for whatever reason the model hasn’t learned that. You could write a function like this:

def filter_long_entities(stream):
    '''Modify a stream, removing long (>=4 words) entities.'''
    for eg in stream:
        new_entities = []
        for entity in eg['spans']:
            if (entity['end'] - entity['start']) >= 4:
        eg['spans'] = new_entities
        yield eg

And then inside the recipe, write something like stream = filter_long_entities(stream)


Okay, I have used the recipe ner.silver-to-goldand corrected all the annotations that needed corrections (mainly annotating the parts that were of MERCHANT label) - until “No tasks available”.

Then, I have used the ner.batch-train recipe, using the newly created dataset from the previous step. I have used the --no-missing flag in order for model to make stronger assumptions.

This is the output of the ner.batch-train:
Using 50% of accept/reject examples (268) for evaluation
Using 100% of remaining examples (268) for training
Dropout: 0.2 Batch size: 13 Iterations: 10

BEFORE 0.542
Correct 230
Incorrect 194
Entities 288
Unknown 0

01 1.614 299 173 405 0 0.633
02 0.821 306 144 390 0 0.680
03 0.686 320 112 386 0 0.741
04 0.530 305 138 382 0 0.688
05 0.455 320 118 392 0 0.731
06 0.437 320 111 385 0 0.742
07 0.341 318 112 382 0 0.740
08 0.191 318 108 378 0 0.746
09 0.284 315 114 378 0 0.734
10 0.278 318 111 381 0 0.741

Correct 318
Incorrect 108
Baseline 0.542
Accuracy 0.746

Checking the ner.print-best output I can notice that the results are not very good, meaning that the majority of actual MERCHANT labels are missing.

For the ones that were recognized as the MERCHANT and properly labelled, I can see a low confidence score, e.g.:

{“text”: “CARTA DI DEBITO ADDEBITO ESEGUITO IL PELUQUEVIAS”, “spans”: [{“start”: 0, “end”: 15, “text”: “CARTA DI DEBITO”, “rank”: 3, “label”: “TRXTYPE”, “score”: 0.7036750537584333}, {“start”: 37, “end”: 48, “text”: “PELUQUEVIAS”, “rank”: 3, “label”: “MERCHANT”, “score”: 0.005171736844078305}], “_input_hash”: -948486369, “_task_hash”: -1565393250}

However, when I load the model and present it with the same example, it only assigns the TRXTYPE label.

Please, let me know if I have ommitted something.

Thanks in advance!

I think this overall this looks reasonable for the size of the dataset. You might want to try training again a few times with different hyper-parameters, to see if you can get a better result. When the dataset is very small, it’s good to try a batch-size of 2. You might also try training for more iterations, and increasing the dropout to 0.5.

Run it a few times with different settings and see whether you get better results. I would then suggest doing more annotation, and then going through the silver-to-gold process again to check that things are improving.

Hi! I did as you have suggested.

I have done more annotation using silver-to-goldrecipe and saved the annotations to a new dataset.

Then, I have tried to train the model again but it does not improve the baseline accuracy.

Here is the output:

Using 50% of accept/reject examples (268) for evaluation
Using 100% of remaining examples (268) for training
Dropout: 0.5 Batch size: 2 Iterations: 15

BEFORE 0.833
Correct 348
Incorrect 70
Entities 380
Unknown 0

01 27.936 343 94 394 0 0.785
02 25.019 341 80 376 0 0.810
03 26.837 338 96 386 0 0.779
04 27.133 343 85 385 0 0.801
05 18.463 339 92 384 0 0.787
06 18.678 348 78 388 0 0.817
07 13.618 348 79 389 0 0.815
08 11.538 342 89 387 0 0.794
09 14.711 339 94 386 0 0.783
10 12.308 339 94 386 0 0.783
11 12.671 345 78 382 0 0.816
12 11.958 346 80 386 0 0.812
13 11.065 342 93 391 0 0.786
14 9.631 345 80 384 0 0.812
15 10.387 338 98 388 0 0.775

Correct 348
Incorrect 78
Baseline 0.833
Accuracy 0.817

Could you paste the command you ran? I think you might be missing the --no-missing argument, but it’s hard to tell. The dataset is also quite small still, although I’d think it would be able to learn from here. It does depend on how consistent the annotations are, though.

Hi! Sure.

$ prodigy ner.batch-train dataset model --no-missing --n-iter 15 --batch-size 2 --dropout 0.5