Add on new name entity incrementally...

Hello,

In the great demo video of how to train a new name entity on reddit comment [https://prodi.gy/docs/video-new-entity-type], I am curious if at the step of ner.batch-trian, the training process automatically samples and mixes up the annotations from existing entity with the new entity annotations examples (in both train/eval set), so that it doesn't forget the old entity types. In addition, I am assuming that the new model also preserves the old entity, instead of only recognizing the new entity type, correct?

In my case, I will need to add on multiple name entities. Thus, can I train each new entity on previous temporarily saved model iteratively, if binary annotation is the best practice to go for sample collection. Furthermore, how can I reuse the already annotated text, for there are multiple entities in each text? Should I feed in the original text again, or load the saved annotation file for use? Would the annotations present themselves in a list of tuples for multiple entities.

In regard to very specific entity types in language such as Chinese, is it probably better to train from a blank model as I understand? Does it suffice to recognize name entity if the blank model can only segment and tokenize words, meaning that it's not required to have a model trained with pos, dependency parser or word vector?

In aspect of specifying patterns, I am not sure if there are cases for matcher in Chinese, or I should just simply use the pattern of [{}, {'orth': WORD}] to identify possible entity type.

Finally, I notice the annotation content and format from prodigy is quite different from the annotation output I used to know in spacy, which has a pattern like {entity: [(start_chr, end_chr, label)]}. I am not sure if I'd like to reuse the annotation samples in spacy model updating, or even export it in IOB/BIOES format for used in other scenario, would that be possible?

Thanks for any clarifications.

Hi! In case you haven't seen it, you might find the NER flowchart useful that describes different scenarios and some general purpose tips and tricks:

Thats something that you have to include in your training data if this is how you want the model to behave. The ner.make-gold workflow could help you with that: it'll highlight all entities that the model already predicts and lets you correct the predictions and add more labels if needed.

If your new labels potentially overlap with existing types, you probably want to start training from scratch. Otherwise, you're constantly "fighting" the existing predictions and you'll need much more data to teach the model to suddenly predict a label very differently. You can still use a workflow like ner.make-gold to create the data semi-automatically, though.

While you can use the previously trained model in the loop, it's not really the best stategy if you want to train your final model. It's usually better and cleaner to use the same base model and train it on all annotations.

We don't distribute pretrained spaCy NER models for Chinese, so unless you already have your own pretrained model, you probably need to start from scratch anyways. The model components are completely independent – so in order to train an entity recognizer, you won't need a tagger or parser.

However, having word vectors can improve the training accuracy, because if word vectors are available, those will be used as features in the model and let you provide more information. How well this works in Chinese depends on the tokenization and you typically want to make sure that the vectors you use include embeddings for the same words that the tokenizer produces (and not just invidividual characters).

It's not that different? Prodigy just uses named keys for the start, end and label – so instead of (start, end, label), an entity is described as {"start": start, "end": end, "label": label}.

Sure, you can always export the data and use it in spaCy. The values are the same – and if you ever need BILUO tags etc., you can use spaCy's conversion utilities.

1 Like

Hello Ines,

Thanks for your prompt response, and pardon me for a bit more for further clarification.
The NER flowchart absolutely helps massively in a way the we get to have wholistic view of how to practice our NER training. I'd really be appreciated if more of alike are coming.

In the case of ner.teach, if we include pre-trained model, say, en_core_web_lg and run the script

ner.teach drug_ner en_core_web_lg <training-dataset> --label DRUG, LOCATION --patterns <drug_patterns.jsonl>

Does it tell the model to identify Drug in the training text based on the patterns we provide, and simultaneously, it also identifies LOCATION for us based on what the pre-trained model has leant.

So that if we'd like to add in a new entity while keep the existing, we would have some dataset that include all entities labeled for us to use in ner.batch-training.

Secondly, for the use of ner.silver-to-gold and ner.make-gold, I assume that these are best applied after we acquire the semi-auto annotated dataset from model suggestions and refine them with manual inspection. So that it's just more efficient than having ner.manual from scratch.

Thirdly, I am not use if at the moment, does spacy's matcher is applicable on Chinese words, or normally, I should just stick to orth for patter recognition.

In aspect of training our custom word vector vis gensim and applying it in spacy.
From these great threads:

I guess if we import our trained vectors via

python -m spacy init-model zh <my_zh_model> --vectors-loc xxx.word2vec.txt.gz

wihtout giving the --jsonl-loc <vocabulary file>. Spacy would still update the vectors of these tokens if they exist in the base model? However, if we'd like to provide the vocabulary_file, should we fill in all attributes in the entry structure [https://spacy.io/api/annotation#vocab-jsonl], or we can leave some unfilled.

Lastly, I have read through a couple of threads on how to better design NER. For instance, the case you brought up in Datacamp lesson, saying that CLOTHES is better than asking the model to identify SHIRTS and PANTS. But I am not very clear why it would make the model harder to learn? Is it because these shirts and pants entities appear and surround in similar tokens that it's just harder for model to distinguish the two based on the context. But then what about PERSON and LOCATION, these two entity types are proper noun, I assume these words will appear in quite similar sentence syntax structure. Why it's applicable in this case?

Sorry again for bombing you all the questions. And thanks again for your contribution in NLP field so much.