Hello,
I want to build up an NER system to extract some information out of texts. my texts are in german and have some special cases that I want to add them to the tokenizer to be handled in the right way, for example I have unique identifiers like "F01-RS2:2" and this must be recognized as ONE token and classified as IDENTIFIER (after training prodigy).
I followed the following steps:
1- created spacy blank model in german
2- I added special_cases to tokenizer by
3- I saved the model using:
nlp=spacy.load('blank:de')
nlp.tokenizer.add_special_case('°C', [{"ORTH": '°C'}])
nlp.tokenizer.add_special_case('Xdd-XXd:d', [{"SHAPE": 'Xdd-XXd:d'}])
nlp.to_disk('./my-model')
After that I changed the name of the model in ./my-model/meta.json to "name":"extended_de_model"
4- I created custom model using spacy package in Terminal:
spacy package ./my-model/ ./model_packages
#then
pip install de_extended_de_model
5- I started Prodigy by:
prodigy ner.manual all_annotation3 de_extended_de_model ./input_prodigy.jsonl --label CLASS,UI,VALUE --patterns ./patterns.jsonl
and annotated all texts...
6- I trained the model prodigy train --ner all_annotation3 tmp/trained-model --eval-split 0.25
FIRST PROBLEM:
the saved models model-last and model-best are english models!
checked in "lang":"en"
in meta.json
SECOND PROBLEM:
the tokenizer of that model is not the same as I created in step 3 and 4
I tried to add_pipe ner from model-best to de_extended_de_model, but it fails also in labeling.
MY GOAL:
I want to have in the end one unified model that tokenizes the way i predefined (using special cases) and then labels the tokens correctly because I want to do some further works in rule-based pattern matching and dependency parsing/matching and so on.
Note: I tried also to define a new model like de_core_news_lg and add the ner as add_pipe from model-best but it recognizes the entities in a wrong way...
Thanks