Ner.teach annotations that improved model produced from ner.batch-train


We would like to warp up around 800 previously annotated (by labelers) data into a model and hope to improve it by manually looking into each label.

The current path we are taking right now:

db-in the annotations --> ner.batch-train to build a model (adjust the combination to achieve the highest result) --> ner-teach to manually Accept or Deny each instance

How do we export the manually adjusted annotations to the previously trained model? Or is it possible to export these annotations to train a new model?

Hi! I don't think I understand the question, sorry! Could you give an example?

Hi Ines,

Here is what I have done for my project (to generate a model that auto-label data based on 800 pre-annotated data):

  1. I created a database that contains the 800 pre-annotated data
  2. I run ner.batch-train to generate model A which will tell me the accuracy rate of this mode
  3. In order to improve the model, I used ner-teach to manually go through each of the pre-annotated data and is able to fix a lot of them by clicking yes or no

My question: how to build a model B that includes the manually annotated data from step 3? Or does all ner-teach data automatically updated model A, when I saved the and closed ner-teach ?

Thanks for the clarification!

ner.teach doesn't save out the updated model, no. You typically always want to batch train the model "properly" afterwards to get even better results.

When you annotate with ner.teach, make sure you save those annotations to a separate dataset. When you're done annotating, you can then take model A, run ner.batch-train with the accept/reject annotations and output model B.

Hi Ines,

Thanks for your response. What is the command for saving ner.teach results to a seperate dataset? I initiated ner.teach as following:

  1. dataset DatasetA "DatasetA"
  2. db-in DatasetA pre_annotated_data_1.json
  3. ner.batch-train DatasetA en_core_web_sm --output modelA
  4. ner.teach DatasetA modelA pre_annotated_data_1.json

Are you suggesting that I should db-out the saved annotation from step 4 to a seperated dataset (DatasetB) and ner.batch-train from DatasetB?

Sorry if this was unclear! I meant that when you run ner.teach, the first argument (in your example, "DatasetA"), should be a different name. For example, "DatasetB". This will save the annotations you create to the othee dataset.

When you then batch train again in step 5, you can update the output model of step 4 with the annotations from DatasetB.

I think I understand now,

I should be doing this: ner.teach DatasetB modelA pre_annotated_data_1.json in order to save the newly annotated data in to a seperate dataset (DatasetB)

Then I am able to do ner.batch-train such as:
ner.batch-train DatasetB en_core_web_sm --output modelB

Let me know if I am understanding right

I have encountered another issue regarding the above:

When I followed what you mentioned:
1)ner.teach result save to a separate dataset (DatasetB), around 1600 annotations
2)ner.batch-train DatasetB en_core_web_sm --output modelB

I only see 300 examples loaded in ner.batch-train which command I used wrong?

Are these the actual examples that are loaded, or the examples used for training? It's possible that you end up with fewer unique training examples, because Prodigy will merge all annotations on the same text into one example. The recipe will also hold back examples for evalution if you do not provide an evaluation set, so it can output accuracy results.

well, the total examples is 400, i used a 20/80 split. Even though, the total annotations I made was 1000+.

When I db-in the original examples (800 pre-annotated data), and ner.batch-train at the first place, it used 700ish examples for training, which means it loaded all examples for training the model?