Hello,
I created 24908 documents with labels using EntityRuler, PhraseMatcher. Then i used ner.batch-train with below options "--n-iter 10 --eval-split 0.2 --dropout 0.2 --unsegmented --no-missing"
The accuracy is not improving much in the last iterations. Do i need to do more iterations or add more data?. I think dataset with 24000 is a very good to use batch-train. May i know why the accuracy is not improving?.
5:05:54 - MODEL: Using 24908 examples (without 'ignore')
Using 20% of accept/reject examples (4929) for evaluation
15:05:59 - RECIPE: Temporarily disabled other pipes: ['tagger', 'parser']
15:05:59 - RECIPE: Initialised EntityRecognizer with model en_core_web_sm
15:05:59 - MODEL: Merging entity spans of 4929 examples
15:05:59 - MODEL: Using 4929 examples (without 'ignore')
15:08:50 - MODEL: Evaluated 4929 examples
15:08:50 - RECIPE: Calculated baseline from evaluation examples (accuracy 0.00)
Using 100% of remaining examples (19719) for training
Dropout: 0.2 Batch size: 16 Iterations: 10
BEFORE 0.000
Correct 0
Incorrect 48658
Entities 189068
Unknown 0
I think your dataset size is fine, so I think the areas to look at are the settings.
Try using en_vectors_web_lg instead of en_core_web_sm: I think the model is trying to train the classifier on top of the existing entities, while en_vectors_web_lg starts you off with word vectors and a blank model, which should be better.
You might also find it helpful to convert the data over so that you can train spaCy directly. This lets you use whichever version of spaCy you want, and lets you use the extra features in spacy train. The easiest way to do this is:
Thanks for the reply. I will try vectors and run batch-train again. I did convert to spacy format. See my below steps and let me know if anything seems wrong.
python -m spacy convert dataset_03.jsonl data_03.json --lang en
Split the dataset_03.json into data_train.json(80%), data_test.json(20%)
python -m spacy debug-data en data_train.json data_test.json -V
python -m spacy train en data_03_model data_train.json data_test.json
But each iteration taking atleast 1 hour. I thought something wrong and stopped the training. I never used spaCy train command. In the spacy doc, it says default 30 iterations. does this mean my whole Spacy training takes arounds 30 hours?. Is it common?
spacy trian command:
spaCy train with default options:
python -m spacy train en data_03_model data_train.json data_test.json
Result:
Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS
..............................
..............................
21 0.000 641.899 0.000 59.259 16.593 25.927 94.536 100.000 10626 0
22 0.000 521.809 0.000 58.926 16.461 25.733 94.536 100.000 10578 0
23 0.000 602.534 0.000 58.672 16.571 25.843 94.536 100.000 10632 0
24 0.000 508.629 0.000 57.198 16.483 25.591 94.536 100.000 10657 0
25 0.000 511.100 0.000 56.297 16.571 25.605 94.536 100.000 10570 0
26 0.000 393.782 0.000 55.242 16.395 25.285 94.536 100.000 10608 0
27 0.000 405.062 0.000 55.898 16.417 25.379 94.536 100.000 10577 0
28 0.000 459.113 0.000 55.422 16.240 25.119 94.536 100.000 10594 0
29 0.000 412.692 0.000 55.365 16.395 25.298 94.536 100.000 10533
If i increase number of iterations to 50, the accuracy of last 10 iterations went down to 24 from 25. Is there anything missing while i train using spaCy command?.
I think this is trying to train a whole pipeline, perhaps? If you add the --pipeline ner argument, it should only train the NER, which should speed things up.
I'm not sure why the results are like that though. Maybe try the --vectors en_vectors_web_lg argument, and see if that helps?
Aah, I think I know what's wrong -- sorry, there's a step missing that is handled by Prodigy. I think maybe we need to call ner.models.ner.merge_spans on your data. if you have multiple examples that refer to the same text, with different annotations, this function consolidates them together.
Try this:
import prodigy.components.db
import prodigy.models.ner
import json
DB = prodigy.components.db.connect()
examples = DB.get_dataset(dataset_name)
print(len(examples), "before merging")
examples = prodigy.models.ner.merge_spans(examples)
print(len(examples), "after merging")
for example in examples:
print(json.dumps(example))
the above code merge spans for prodigy right?. Then do i need to run batch-train again with new dataset?.
or do i need to run that for spacy trian command?. Sorry for asking it is not clear to me.
The above code merges the annotations you collected so that each example only exists once and contains all annotated spans (accepted and rejected). Prodigy's ner.batch-train will run that automatically (you can also see that when you look at the recipe).
But if you've collected binary annotations with Prodigy and you don't merge them before converting them to train with spaCy, you may end up with multiple conflicting annotations and worse results. So if you're not doing this already, you should merge the spans before you convert your data for spaCy.
Thank you so muhc for the clarification. I didnt use Prodigy for annotations. I already have most of the required entities so i have created prodigy spans to feed to batch-train and spacy format for spacy train using EntityRuler. Basically i am using existing data to build training data then trying train the model using CLI.
Sorry I'm having trouble understanding your workflow. So you created patterns, ran spaCy's EntityRuler over some texts, and then you're training a model to predict what the EntityRuler recognised? Won't the model just learn to repeat the rules?
I am really sorry if its confusing. Let me explain it very clearly.
We want to use NER to replace feeds based system in our production So we already have a pipeline identifying relevant information by certain complicated rules. At first i tried annotate using prodigy using ner.manual but it is taking ages so i started using existing data to create spans for a given text.
So far my workflow:
->step 1: fetch actual text and enties from database. Create jsonl file with prodigy style format with just span (no token). repeat this step util enough dataset i fetched around 24000
ex: my_regex_patterns.append({"label": label.upper(), "pattern": value.lower()})
entity_ruler.add_patterns(my_regex_patterns)
nlp.add_pipe(entity_ruler)
doc = nlp(text.lower())
entities = [ent for ent in list(doc.ents)]
for ent in entities:
span_list.append({"start": ent.start_char, "end": ent.end_char, "label":
ent.label_})
-> then run the ner-batch-train. i used this because before coming this workflow i used ner.manual and batch-train so i was very comfortable with the output of prodigy.
-> convert that step1 jsonl file into spaCy format. create train and test datasets using
below commands
python -m spacy convert dataset_03.jsonl data_03.json --lang en
Split the dataset_03.json into data_train.json(80%), data_test.json(20%)
-> then finally run spacy train command.
Could you please let me know where i am doing wrong. Also please suggest better ideas for my work.
I had an issue training my model, I realised it was because I was not merging entities for the same text. I can see from the way you are collecting your training data, you might get similar issue.
For example, you cannot train your model with these two training data:
("Google and Apple are companies.", {'entities': [(0, 6, 'Company')]})
("Google and Apple are companies.", {'entities': [(11, 16, 'Company')]})
Instead you need to have:
("Google and Apple are companies.", {'entities': [(0, 6, 'Company'), (11, 16, 'Company')]})
Also don't remove any training data with no entities, for example:
("Apple is a fruit.", {'entities': []})
is a useful training data.
I'm new to Spacy, please correct me if I'm wrong.
I hope it helps.
Hi PEDRAM, Thanks for your reply. As I am programmatically creating entities from ORACLE, my script mapping them to below way. I do not have individual entities in the Jsonl file.
("Google and Apple are companies.", {'entities': [(0, 6, 'Company'), (11, 16, 'Company')]})
At the same time, i can not produce empty list as every text in ORACLE contains at least one field corresponding to the text. is this a problem not to have empty list in dataset?. I feel like i nearly there to train a model.
Okay, thanks! I think I understand better now. I wonder whether the problem could be in the data splitting? Did you make sure the split is on a random 20%? It might be easier to shuffle the jsonl lines before you convert them, instead of splitting the json file afterwards.
More generally, I think training an NER model on the output of regex rules is unlikely to get you a better result than the rules themselves, unless you first manually correct any mistakes there might be in the rule output.
Have you tried using the ner.make-gold recipe? This would let you work through the data with the suggestions from the EntityRuler, and approve them. This should be quicker than ner.manual, while giving you results that are more correct than the rules.