Problem in training the model

dhavalv83 · May 2, 2020, 7:47am

Hello,

prodigy train ner corpus blank:en --output ./corpus-test-model --eval-split 0.2 --n-iter 15

This is my cmd to train the model. I've 210000 records out of which 807 is annotated, So when i start training of ner than I get this output.

ines · May 3, 2020, 11:08am

Hi! What's in your dataset corpus and how do you annotations look? And when you say "807 is annotated", does that mean you're training on the whole corpus of annotated and unannotated examples? If so, that explains what's going on here: you're training your model with tons of data with no entity annotations, so it's not learning anything. Instead, you want to be training it on the annotations only.

Prodigy doesn't require you to import any unannotated examples – you can just load in your data and annotate, and then save your annotations to a dataset. You can then use the train command to run a training experiment using your dataset of annotated examples.

dhavalv83 · May 4, 2020, 6:04am

Hi,

I’ve 210000 text files as my corpus , out of this I’ve annotated 800+ files manually in prodigy for 2 entities Organization(ORG) and product(PRODUCT).

Now I want to annotate the remaining files and build a model.

ines · May 4, 2020, 6:34pm

In your screenshot, it does look like you're training from 120k+ examples, though – so maybe you accidentally imported your raw unannotated corpus and are training from that? This would explain why you're not getting any results. If you've only annotated 800 examples, those are also the examples you should be training on.

dhavalv83 · May 5, 2020, 3:18pm

Hi Ines,
indeed we have initially trained only on those 800 files but we have noticed no increase in learning from the 500 files upwards. We initially annotated 500 files and performed the training and then we have re-done the training with an additional gradient of about 100 files. We stalled from the first training on the initial values of precision and re-call for both our labels.
Then we thought the training might achieve better results if we go with the entire corpus. Thus we brought the 807 files with db-in into the entire corpus of ~210000 files and have done a re-hash.

As you can see in the attached screenshot, our corpus do contain 211 468 files out of which about 37 about been ignored - they have been identified as doubles in the annotation process or ignored there. It is unclear to me why during the training only ~120000 files were considered out of the total in the corpus, as per my stats. This is unfortunately confusing me even more...

At the end of the day, what we want to achieve is to use a larger corpus for the training so that we can enhance our model

Here are also the commands that we have used:

Command to import the annotated file with 807 files into the corpus:
prodigy db-in corpus < corpus-ner.jsonl
Command for training the model:
prodigy train ner corpus blank:en --output ./corpus-test-model --eval-split 0.2 --n-iter 15

dhavalv83 · May 5, 2020, 3:36pm

This new screenshot is showing that by a new execution, prodigy has taken in consideration again only aroun 120000 files out of the total of 211000 in the corpus. Is there a logic on how I can enforce that the entire corpus is considered? The same training command has been used.

ines · May 5, 2020, 10:08pm

Okay, so just to make sure I understand this correctly: your corpus_ner.jsonl file that you're importing to the dataset here has 200k lines and those are all annotated examples with a "text" and "spans" describing the entities in the text? And those are the examples you now want to train a model on?

dhavalv83 · May 6, 2020, 6:56am

Hi!
corpus_ner.jsonl includes those 807 text files with the correct annotations for ORG and PRODUCT.

ines · May 7, 2020, 10:55am

Maybe you accidentally imported and added to the same dataset multiple times? Maybe try again with a fresh dataset and only your annotated examples? And then, as you need to annotate more examples, save the result to a new dataset so it's easier to analyse and inspect them separately. It definitely looks like there's something off with the data you're training from, if you're seeing 0 results everywhere.

dhavalv83 · May 26, 2020, 1:18pm

Hello

Tried again with below steps.

We are training NER model for our corpus with prodigy on two labels. here is setup we did it

We have ~2,10,000 text files in our corpus.
We annotated ~807 files manually for ORG and PRODUCT.
Now we have to train model with annotated and non-annotated files (somethings like automatic annotation for remaining files).
so what we did is
Created new dataset, inserted annotated and unannotated files(800+14200) with db-in and than start training
and here is the result

Screenshot from 2020-05-25 18-40-191307×721 68.7 KB

What we want to achieve here is to train a model on this combine data of annotated and non-annotated dataset.

Also if you can see in the screenshot, for first 8 sets we have 0 precision and f-score. which is improved later but overall 0 for product, so wants to understand this behaviour also.

ines · May 26, 2020, 1:35pm

Is there are reason you're training on annotated and unannotated files? Maybe I misunderstand the workflow, but it doesn't really make sense to me? If you're training from unannotated examples, you're essentially telling the model "this text contains no entities" and that's wrong?

Topic		Replies	Views
Debugging NER - batch_train with custom dataset ner	5	588	October 16, 2019
Training few new entities: Result very low usage , ner , spacy	3	17	January 29, 2025
Improve trained models with annotations usage , ner , training	3	517	September 20, 2021
doubt in ner usage , ner	5	453	August 4, 2020
ner.train number of examples usage , ner	8	1941	August 3, 2018

Problem in training the model

Related topics