Problem in training the model

Hello,

prodigy train ner corpus blank:en --output ./corpus-test-model --eval-split 0.2 --n-iter 15

This is my cmd to train the model. I've 210000 records out of which 807 is annotated, So when i start training of ner than I get this output.

Hi! What's in your dataset corpus and how do you annotations look? And when you say "807 is annotated", does that mean you're training on the whole corpus of annotated and unannotated examples? If so, that explains what's going on here: you're training your model with tons of data with no entity annotations, so it's not learning anything. Instead, you want to be training it on the annotations only.

Prodigy doesn't require you to import any unannotated examples – you can just load in your data and annotate, and then save your annotations to a dataset. You can then use the train command to run a training experiment using your dataset of annotated examples.

Hi,

I’ve 210000 text files as my corpus , out of this I’ve annotated 800+ files manually in prodigy for 2 entities Organization(ORG) and product(PRODUCT).

Now I want to annotate the remaining files and build a model.

In your screenshot, it does look like you're training from 120k+ examples, though – so maybe you accidentally imported your raw unannotated corpus and are training from that? This would explain why you're not getting any results. If you've only annotated 800 examples, those are also the examples you should be training on.

Hi Ines,
indeed we have initially trained only on those 800 files but we have noticed no increase in learning from the 500 files upwards. We initially annotated 500 files and performed the training and then we have re-done the training with an additional gradient of about 100 files. We stalled from the first training on the initial values of precision and re-call for both our labels.
Then we thought the training might achieve better results if we go with the entire corpus. Thus we brought the 807 files with db-in into the entire corpus of ~210000 files and have done a re-hash.

As you can see in the attached screenshot, our corpus do contain 211 468 files out of which about 37 about been ignored - they have been identified as doubles in the annotation process or ignored there. It is unclear to me why during the training only ~120000 files were considered out of the total in the corpus, as per my stats. This is unfortunately confusing me even more...

At the end of the day, what we want to achieve is to use a larger corpus for the training so that we can enhance our model

Here are also the commands that we have used:

  • Command to import the annotated file with 807 files into the corpus:
    prodigy db-in corpus < corpus-ner.jsonl

  • Command for training the model:
    prodigy train ner corpus blank:en --output ./corpus-test-model --eval-split 0.2 --n-iter 15

This new screenshot is showing that by a new execution, prodigy has taken in consideration again only aroun 120000 files out of the total of 211000 in the corpus. Is there a logic on how I can enforce that the entire corpus is considered? The same training command has been used.

Okay, so just to make sure I understand this correctly: your corpus_ner.jsonl file that you're importing to the dataset here has 200k lines and those are all annotated examples with a "text" and "spans" describing the entities in the text? And those are the examples you now want to train a model on?

Hi!
corpus_ner.jsonl includes those 807 text files with the correct annotations for ORG and PRODUCT.

Maybe you accidentally imported and added to the same dataset multiple times? Maybe try again with a fresh dataset and only your annotated examples? And then, as you need to annotate more examples, save the result to a new dataset so it's easier to analyse and inspect them separately. It definitely looks like there's something off with the data you're training from, if you're seeing 0 results everywhere.

Hello

Tried again with below steps.

We are training NER model for our corpus with prodigy on two labels. here is setup we did it

  • We have ~2,10,000 text files in our corpus.
  • We annotated ~807 files manually for ORG and PRODUCT.
  • Now we have to train model with annotated and non-annotated files (somethings like automatic annotation for remaining files).
    so what we did is
  • Created new dataset, inserted annotated and unannotated files(800+14200) with db-in and than start training
  • and here is the result

What we want to achieve here is to train a model on this combine data of annotated and non-annotated dataset.

Also if you can see in the screenshot, for first 8 sets we have 0 precision and f-score. which is improved later but overall 0 for product, so wants to understand this behaviour also.

Is there are reason you're training on annotated and unannotated files? Maybe I misunderstand the workflow, but it doesn't really make sense to me? If you're training from unannotated examples, you're essentially telling the model "this text contains no entities" and that's wrong?