accuracy not improving much with ner.batch-train

Hello,
I created 24908 documents with labels using EntityRuler, PhraseMatcher. Then i used ner.batch-train with below options "--n-iter 10 --eval-split 0.2 --dropout 0.2 --unsegmented --no-missing"

The accuracy is not improving much in the last iterations. Do i need to do more iterations or add more data?. I think dataset with 24000 is a very good to use batch-train. May i know why the accuracy is not improving?.

5:05:54 - MODEL: Using 24908 examples (without 'ignore')
Using 20% of accept/reject examples (4929) for evaluation
15:05:59 - RECIPE: Temporarily disabled other pipes: ['tagger', 'parser']
15:05:59 - RECIPE: Initialised EntityRecognizer with model en_core_web_sm
15:05:59 - MODEL: Merging entity spans of 4929 examples
15:05:59 - MODEL: Using 4929 examples (without 'ignore')
15:08:50 - MODEL: Evaluated 4929 examples
15:08:50 - RECIPE: Calculated baseline from evaluation examples (accuracy 0.00)
Using 100% of remaining examples (19719) for training
Dropout: 0.2 Batch size: 16 Iterations: 10

BEFORE 0.000
Correct 0
Incorrect 48658
Entities 189068
Unknown 0

LOSS RIGHT WRONG ENTS SKIP ACCURACY

15:42:42 - MODEL: Merging entity spans of 4929 examples
15:42:42 - MODEL: Using 4929 examples (without 'ignore')
15:45:12 - MODEL: Evaluated 4929 examples
01 5095298.645 15456 44599 26853 0 0.257
16:24:57 - MODEL: Using 4929 examples (without 'ignore')
16:27:45 - MODEL: Evaluated 4929 examples
02 5158556.137 29951 29299 40757 0 0.506
17:29:12 - MODEL: Merging entity spans of 4929 examples
17:29:13 - MODEL: Using 4929 examples (without 'ignore')
17:31:11 - MODEL: Evaluated 4929 examples
03 5091354.893 34732 23599 44824 0 0.595
18:06:08 - MODEL: Merging entity spans of 4929 examples
18:06:08 - MODEL: Using 4929 examples (without 'ignore')
18:08:04 - MODEL: Evaluated 4929 examples
04 5040571.233 36622 21549 46621 0 0.630
18:40:29 - MODEL: Merging entity spans of 4929 examples
18:40:29 - MODEL: Using 4929 examples (without 'ignore')
18:42:29 - MODEL: Evaluated 4929 examples
05 4933042.997 37395 20333 46988 0 0.648
22:00:31 - MODEL: Merging entity spans of 4929 examples
22:00:32 - MODEL: Using 4929 examples (without 'ignore')
22:03:09 - MODEL: Evaluated 4929 examples
06 4978476.187 37843 19299 46856 0 0.662
22:40:34 - MODEL: Merging entity spans of 4929 examples
22:40:34 - MODEL: Using 4929 examples (without 'ignore')
22:42:39 - MODEL: Evaluated 4929 examples
07 5028080.368 38169 18570 46800 0 0.673
23:09:27 - MODEL: Merging entity spans of 4929 examples
23:09:28 - MODEL: Using 4929 examples (without 'ignore')
23:11:25 - MODEL: Evaluated 4929 examples
08 4943048.409 38547 18011 47009 0 0.682
06:59:55 - MODEL: Merging entity spans of 4929 examples
06:59:56 - MODEL: Using 4929 examples (without 'ignore')
07:01:51 - MODEL: Evaluated 4929 examples
09 4869467.426 38833 17512 47081 0 0.689
07:34:13 - MODEL: Merging entity spans of 4929 examples
07:34:14 - MODEL: Using 4929 examples (without 'ignore')
07:36:27 - MODEL: Evaluated 4929 examples
10 4931146.311 39036 17146 47122 0 0.695

Correct 39036
Incorrect 17146
Baseline 0.000
Accuracy 0.695

07:36:27 - RECIPE: Restoring disabled pipes: ['tagger', 'parser']

I think your dataset size is fine, so I think the areas to look at are the settings.

Try using en_vectors_web_lg instead of en_core_web_sm: I think the model is trying to train the classifier on top of the existing entities, while en_vectors_web_lg starts you off with word vectors and a blank model, which should be better.

You might also find it helpful to convert the data over so that you can train spaCy directly. This lets you use whichever version of spaCy you want, and lets you use the extra features in spacy train. The easiest way to do this is:

prodigy db-out my-dataset > my-dataset.jsonl
spacy convert my-dataset.jsonl my-dataset.json

You'll want to create a train/test/dev split in your data as well, so that you're always evaluating against the same dataset.

Thanks for the reply. I will try vectors and run batch-train again. I did convert to spacy format. See my below steps and let me know if anything seems wrong.

python -m prodigy ner.batch-train dataset_03 en_core_web_sm --output dataset_03_Model --label man,ad, tile,call,loc,pay,id --n-iter 10 --eval-split 0.2 --dropout 0.2 --unsegmented --no-missing

python -m spacy convert dataset_03.jsonl data_03.json --lang en

Split the dataset_03.json into data_train.json(80%), data_test.json(20%)

python -m spacy debug-data en data_train.json data_test.json -V

python -m spacy train en data_03_model data_train.json data_test.json

But each iteration taking atleast 1 hour. I thought something wrong and stopped the training. I never used spaCy train command. In the spacy doc, it says default 30 iterations. does this mean my whole Spacy training takes arounds 30 hours?. Is it common?

ner.batch-train:
en_vectors_web_lg improved accuracy a lot.
07 105010.762 42974 11188 47833 0 0.793
23:32:01 - MODEL: Merging entity spans of 4929 examples
23:32:02 - MODEL: Using 4929 examples (without 'ignore')
23:36:02 - MODEL: Evaluated 4929 examples
08 100563.650 43148 10878 47871 0 0.799
00:19:41 - MODEL: Merging entity spans of 4929 examples
00:19:41 - MODEL: Using 4929 examples (without 'ignore')
00:21:38 - MODEL: Evaluated 4929 examples
09 96124.139 43139 10749 47724 0 0.801
07:12:19 - MODEL: Merging entity spans of 4929 examples
07:12:19 - MODEL: Using 4929 examples (without 'ignore')
07:14:11 - MODEL: Evaluated 4929 examples
10 93199.083 43238 10600 47773 0 0.803

Correct 43238
Incorrect 10600
Baseline 0.000
Accuracy 0.803

spacy trian command:
spaCy train with default options:
python -m spacy train en data_03_model data_train.json data_test.json

Result:

Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
..............................
..............................
21 0.000 641.899 0.000 59.259 16.593 25.927 94.536 100.000 10626 0
22 0.000 521.809 0.000 58.926 16.461 25.733 94.536 100.000 10578 0
23 0.000 602.534 0.000 58.672 16.571 25.843 94.536 100.000 10632 0
24 0.000 508.629 0.000 57.198 16.483 25.591 94.536 100.000 10657 0
25 0.000 511.100 0.000 56.297 16.571 25.605 94.536 100.000 10570 0
26 0.000 393.782 0.000 55.242 16.395 25.285 94.536 100.000 10608 0
27 0.000 405.062 0.000 55.898 16.417 25.379 94.536 100.000 10577 0
28 0.000 459.113 0.000 55.422 16.240 25.119 94.536 100.000 10594 0
29 0.000 412.692 0.000 55.365 16.395 25.298 94.536 100.000 10533

If i increase number of iterations to 50, the accuracy of last 10 iterations went down to 24 from 25. Is there anything missing while i train using spaCy command?.

any input on why spaCy is giving less accuracy than prodigy's batch-train?.

I think this is trying to train a whole pipeline, perhaps? If you add the --pipeline ner argument, it should only train the NER, which should speed things up.

I'm not sure why the results are like that though. Maybe try the --vectors en_vectors_web_lg argument, and see if that helps?

Sorry for bothering you again and again.

I ran below command with GPU but still i can not reach batch-train accuracy but it accuracy doubled from 25 to 60 where as batch-train gives 80%.

python -m spacy train en data_03_model data_train.json data_test.json --pipeline ner --vectors en_vectors_web_lg --verbose

Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS


0 0.000 16856.079 0.000 41.657 15.755 22.863 0.000 100.000 27670 0
1 0.000 8337.593 0.000 56.357 36.187 44.074 0.000 100.000 30029 0
2 0.000 5879.421 0.000 61.365 46.227 52.731 0.000 100.000 30713 0
3 0.000 4454.633 0.000 67.587 47.529 55.810 0.000 100.000 30065 0
4 0.000 3767.012 0.000 69.480 50.684 58.612 0.000 100.000 30286 0
5 0.000 3595.046 0.000 66.305 52.538 58.624 0.000 100.000 30183 0
6 0.000 2713.307 0.000 66.560 55.252 60.381 0.000 100.000 29800 0
7 0.000 2182.027 0.000 67.480 54.943 60.569 0.000 100.000 30082 0
8 0.000 1895.314 0.000 69.627 54.325 61.031 0.000 100.000 29876 0
9 0.000 1522.148 0.000 68.580 54.568 60.777 0.000 100.000 29674 0
10 0.000 1444.246 0.000 69.091 53.663 60.407 0.000 100.000 29512 0
11 0.000 1545.287 0.000 69.146 52.714 59.822 0.000 100.000 29531 0
12 0.000 1312.477 0.000 69.391 52.824 59.985 0.000 100.000 28778 0
13 0.000 1033.243 0.000 69.866 52.846 60.176 0.000 100.000 29927 0
14 0.000 1109.869 0.000 69.479 53.244 60.287 0.000 100.000 30120 0
15 0.000 792.445 0.000 70.009 53.155 60.429 0.000 100.000 29524 0
16 0.000 847.468 0.000 69.977 52.868 60.231 0.000 100.000 28722 0
17 0.000 578.479 0.000 69.556 52.229 59.660 0.000 100.000 29627 0
18 0.000 714.992 0.000 69.107 51.037 58.713 0.000 100.000 29472 0
19 0.000 788.796 0.000 69.157 51.059 58.746 0.000 100.000 29327 0
20 0.000 503.560 0.000 69.412 51.324 59.013 0.000 100.000 29509 0
21 0.000 458.518 0.000 68.615 51.280 58.694 0.000 100.000 29238 0
22 0.000 513.204 0.000 69.224 50.574 58.447 0.000 100.000 29201 0
23 0.000 505.223 0.000 68.953 51.015 58.643 0.000 100.000 29527 0
24 0.000 430.888 0.000 69.481 51.390 59.082 0.000 100.000 29405 0
25 0.000 455.905 0.000 69.701 52.030 59.583 0.000 100.000 29088 0
26 0.000 455.122 0.000 69.649 52.052 59.578 0.000 100.000 29169 0
27 0.000 396.876 0.000 69.419 52.493 59.781 0.000 100.000 27990 0
28 0.000 391.423 0.000 69.611 52.868 60.095 0.000 100.000 29035 0
29 0.000 284.307 0.000 69.865 52.383 59.874 0.000 100.000 28942 0

Aah, I think I know what's wrong -- sorry, there's a step missing that is handled by Prodigy. I think maybe we need to call ner.models.ner.merge_spans on your data. if you have multiple examples that refer to the same text, with different annotations, this function consolidates them together.

Try this:

import prodigy.components.db
import prodigy.models.ner
import json
DB = prodigy.components.db.connect()
examples = DB.get_dataset(dataset_name)
print(len(examples), "before merging")
examples = prodigy.models.ner.merge_spans(examples)
print(len(examples), "after merging")
for example in examples:
    print(json.dumps(example))

the above code merge spans for prodigy right?. Then do i need to run batch-train again with new dataset?.
or do i need to run that for spacy trian command?. Sorry for asking it is not clear to me.

can you please make it clear to me. Sorry for bothering. I am stuck here.

The above code merges the annotations you collected so that each example only exists once and contains all annotated spans (accepted and rejected). Prodigy's ner.batch-train will run that automatically (you can also see that when you look at the recipe).

But if you've collected binary annotations with Prodigy and you don't merge them before converting them to train with spaCy, you may end up with multiple conflicting annotations and worse results. So if you're not doing this already, you should merge the spans before you convert your data for spaCy.

Thank you so muhc for the clarification. I didnt use Prodigy for annotations. I already have most of the required entities so i have created prodigy spans to feed to batch-train and spacy format for spacy train using EntityRuler. Basically i am using existing data to build training data then trying train the model using CLI.

Sorry I'm having trouble understanding your workflow. So you created patterns, ran spaCy's EntityRuler over some texts, and then you're training a model to predict what the EntityRuler recognised? Won't the model just learn to repeat the rules?

I am really sorry if its confusing. Let me explain it very clearly.
We want to use NER to replace feeds based system in our production So we already have a pipeline identifying relevant information by certain complicated rules. At first i tried annotate using prodigy using ner.manual but it is taking ages so i started using existing data to create spans for a given text.

So far my workflow:

->step 1: fetch actual text and enties from database. Create jsonl file with prodigy style format with just span (no token). repeat this step util enough dataset i fetched around 24000

ex: my_regex_patterns.append({"label": label.upper(), "pattern": value.lower()})   
              entity_ruler.add_patterns(my_regex_patterns)
              nlp.add_pipe(entity_ruler)
              doc = nlp(text.lower())
              entities = [ent for ent in list(doc.ents)]
              for ent in entities:
                     span_list.append({"start": ent.start_char, "end": ent.end_char, "label": 
                                         ent.label_})

-> then run the ner-batch-train. i used this because before coming this workflow i used ner.manual and batch-train so i was very comfortable with the output of prodigy.
-> convert that step1 jsonl file into spaCy format. create train and test datasets using
below commands
python -m spacy convert dataset_03.jsonl data_03.json --lang en
Split the dataset_03.json into data_train.json(80%), data_test.json(20%)
-> then finally run spacy train command.

Could you please let me know where i am doing wrong. Also please suggest better ideas for my work.

Hello,

I had an issue training my model, I realised it was because I was not merging entities for the same text. I can see from the way you are collecting your training data, you might get similar issue.

For example, you cannot train your model with these two training data:
("Google and Apple are companies.", {'entities': [(0, 6, 'Company')]})
("Google and Apple are companies.", {'entities': [(11, 16, 'Company')]})

Instead you need to have:
("Google and Apple are companies.", {'entities': [(0, 6, 'Company'), (11, 16, 'Company')]})

Also don't remove any training data with no entities, for example:
("Apple is a fruit.", {'entities': []})
is a useful training data.

I'm new to Spacy, please correct me if I'm wrong.
I hope it helps.

Hi PEDRAM, Thanks for your reply. As I am programmatically creating entities from ORACLE, my script mapping them to below way. I do not have individual entities in the Jsonl file.
("Google and Apple are companies.", {'entities': [(0, 6, 'Company'), (11, 16, 'Company')]})

At the same time, i can not produce empty list as every text in ORACLE contains at least one field corresponding to the text. is this a problem not to have empty list in dataset?. I feel like i nearly there to train a model.

Hi @mystuff,

Okay, thanks! I think I understand better now. I wonder whether the problem could be in the data splitting? Did you make sure the split is on a random 20%? It might be easier to shuffle the jsonl lines before you convert them, instead of splitting the json file afterwards.

More generally, I think training an NER model on the output of regex rules is unlikely to get you a better result than the rules themselves, unless you first manually correct any mistakes there might be in the rule output.

Have you tried using the ner.make-gold recipe? This would let you work through the data with the suggestions from the EntityRuler, and approve them. This should be quicker than ner.manual, while giving you results that are more correct than the rules.