I am currently writing my master thesis on Named Entity Recognition, and I am using prodigy to see if I can improve the results obtained from spaCy. My dataset contains 50 reports of different Financial Institutions from the last 6 years. Each document in average has 580 pages. I have annotated 5 of the reports presented in the dataset using the recipe ner.teach.
I have divided the annotations in different datasets so I can observe the effect of the annotations in each experiment. In the first experiment I have used at maximum 100 annotations for each label. For the 2nd experiments 200 annotations for each label, and so on until the antepenultimate experiment. The penultimate experiment is where I train a model with word vectors (
en_core_web_lg). The last one I trained a model with word vectors and pretrained tok2vec weights.
To test these models that I trained, I have selected a small portion of one of the reports which contained around 100 entities. In the first experiment I run the spacy model without using prodigy (without any annotation) and I get 61 entities returned. The results of the models created with the annotations are worse than using the model of spacy without any annotation. The precision of the results gets better but the recall is worst which means that I lose too many entities. I would like some help to improve the results or some suggestion in something that I can be doing wrong.
Thanks in advance for any help you can provide.