Correct order in Named Entity Recognition

German · August 8, 2018, 6:51pm

Hi. I am starting with the use of Prodigy and I have doubts about whether the order we are following is correct. Wich would a reccommeded order be?
In the first instance, we created 5 entities (Place, Person, Organization, Area, Position) and we executed ner.manual to be able to teach the model.
Then we executed a ner.make-gold of approximately 300 cases and we saw that the model improved in its recognition. The problem is that we then executed a ner.teach of approximately 1000 cases and as a consequence we observed that the model got worse.
Could you recommend a correct order? For example, first ner.manual, second ner.teach etc.

ines · August 9, 2018, 2:58pm

In general, your approach sounds reasonable: you first created some gold-standard training data manually to bootstrap the new entity types and then went on to improve the model with an active learning-powered recipe.

Could you share some more details on the exact commands you ran? And what did you use as a base model?

German · August 9, 2018, 7:29pm

Thanks for your answer. The entities are in Spanish. I detail the steps that we follow:

prodigy dataset bora_dataset2018

#GOLD1
PRODIGY_PORT=8000 prodigy ner.make-gold bora_dataset_2018 es_core_news_sm --output /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label AREA
PRODIGY_PORT=8001 prodigy ner.make-gold bora_dataset_2018 es_core_news_sm --output /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label ORGANISMO
PRODIGY_PORT=8002 prodigy ner.make-gold bora_dataset_2018 es_core_news_sm --output /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label PERSONA
PRODIGY_PORT=8003 prodigy ner.make-gold bora_dataset_2018 es_core_news_sm --output /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label CARGO
PRODIGY_PORT=8004 prodigy ner.make-gold bora_dataset_2018 es_core_news_sm --output /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label LUGAR
PRODIGY_PORT=8005 prodigy ner.make-gold bora_dataset_2018 es_core_news_sm --output /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label LEY, DECRETO. DA

#TRAIN
prodigy ner.batch-train bora_dataset_2018 /opt/prodigy/data/salidamodelo_bora2018 --output /opt/prodigy/data/salidamodelo_bora2018 --label “AREA, ORGANISMO, PERSONA, CARGO, LUGAR, LEY, DECRETO, DA” --eval-split 0.2 --n-iter 15

#GOLD2 CON MODELO
PRODIGY_PORT=8000 prodigy ner.make-gold bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label AREA
PRODIGY_PORT=8001 prodigy ner.make-gold bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label ORGANISMO
PRODIGY_PORT=8002 prodigy ner.make-gold bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label PERSONA
PRODIGY_PORT=8003 prodigy ner.make-gold bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label CARGO
PRODIGY_PORT=8004 prodigy ner.make-gold bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label LUGAR
PRODIGY_PORT=8005 prodigy ner.make-gold bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label LEY, DECRETO. DA

#TRAIN
prodigy ner.batch-train bora_dataset_2018 /opt/prodigy/data/salidamodelo_bora2018 --output /opt/prodigy/data/salidamodelo_bora2018 --label “AREA, ORGANISMO, PERSONA, CARGO, LUGAR, LEY, DECRETO, DA” --eval-split 0.2 --n-iter 15

#TEACH
PRODIGY_PORT=8000 prodigy ner.teach bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label AREA
PRODIGY_PORT=8001 prodigy ner.teach bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label ORGANISMO
PRODIGY_PORT=8002 prodigy ner.teach bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label PERSONA
PRODIGY_PORT=8003 prodigy ner.teach bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label CARGO
PRODIGY_PORT=8004 prodigy ner.teach bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label LUGAR
PRODIGY_PORT=8005 prodigy ner.teach bora_dataset_2018 /opt/prodigy/data/modelo /opt/prodigy/data/dataset11.txt --label LEY, DECRETO. DA

honnibal · August 11, 2018, 12:52pm

When you say the model got worse, are you basing that on the evaluation printed by ner.batch-train?

In your commands there, it looks like you’re using the --eval-split command to conduct the evaluation. This means that each time you continue training the model, you’ll be evaluating over different data, especially as you continue annotating and the dataset increases in size.

The ner.teach recipe uses active learning, and by default the strategy is to select cases the model is most unsure about. This means you’re biasing the sample towards hard cases. This is good for training, but may be misleading for evaluation.

I would recommend separating out some of your examples, and making an evaluation set. Take care that the texts in your evaluation set do not also occur within your training data, to make sure the accuracy on your evaluation is a better indication of accuracy on other data.

When annotating the evaluation data, you want to use either the ner.make-gold or ner.manual recipes, rather than ner.teach. The goal is to get complete and correct annotations for a random sample of text. ner.teach skips through the text asking question that the model can learn the most from — which isn’t the right strategy for annotating evaluation data.

As a rule of thumb, you’ll want your evaluation data to have at least 10 entities per significant figure of accuracy you want to estimate. So if you want to distinguish, say, 70% accuracy from 71% accuracy, you’ll want an evaluation set with at least 1000 entities.

Topic		Replies	Views
ner.train number of examples usage , ner	8	1953	August 3, 2018
Best strategy for training an NER engine usage , ner	8	2186	December 27, 2017
ner.teach suggests spaces as entities? usage , ner , solved	13	1674	August 3, 2018
Advice on training NER models with new entities usage , ner , hr	13	3891	January 25, 2019
"No tasks available" for ner_manual but not ner ner , solved	6	901	April 10, 2018

Correct order in Named Entity Recognition

Related topics