Hello,
In the great demo video of how to train a new name entity on reddit comment [https://prodi.gy/docs/video-new-entity-type], I am curious if at the step of ner.batch-trian, the training process automatically samples and mixes up the annotations from existing entity with the new entity annotations examples (in both train/eval set), so that it doesn't forget the old entity types. In addition, I am assuming that the new model also preserves the old entity, instead of only recognizing the new entity type, correct?
In my case, I will need to add on multiple name entities. Thus, can I train each new entity on previous temporarily saved model iteratively, if binary annotation is the best practice to go for sample collection. Furthermore, how can I reuse the already annotated text, for there are multiple entities in each text? Should I feed in the original text again, or load the saved annotation file for use? Would the annotations present themselves in a list of tuples for multiple entities.
In regard to very specific entity types in language such as Chinese, is it probably better to train from a blank model as I understand? Does it suffice to recognize name entity if the blank model can only segment and tokenize words, meaning that it's not required to have a model trained with pos, dependency parser or word vector?
In aspect of specifying patterns, I am not sure if there are cases for matcher in Chinese, or I should just simply use the pattern of [{}, {'orth': WORD}] to identify possible entity type.
Finally, I notice the annotation content and format from prodigy is quite different from the annotation output I used to know in spacy, which has a pattern like {entity: [(start_chr, end_chr, label)]}. I am not sure if I'd like to reuse the annotation samples in spacy model updating, or even export it in IOB/BIOES format for used in other scenario, would that be possible?
Thanks for any clarifications.