I’m trying to train Spacy to recognize peoples names in Portuguese-language legal documents. I have a file (paragrafos.txt) with about 2000 paragraphs from legal contracts. I start with the standard Portuguese model and have been training from there. I annotated using:
prodigy ner.teach ner_nome pt_core_news_sm paragrafos.txt --label PER
After about 6000 annotations, I run batch-train and still get only a 37% accuracy!
prodigy ner.batch-train ner_nome pt_core_news_sm --output /model --eval-split 0.5 --label PER --batch-size 2
This is the result:
> Using 1 labels: PER
>
> Loaded model pt_core_news_sm
> Using 50% of accept/reject examples (1217) for evaluation
> Using 100% of remaining examples (1516) for training
> Dropout: 0.2 Batch size: 2 Iterations: 10
>
>
> BEFORE 0.061
> Correct 40
> Incorrect 617
> Entities 5345
> Unknown 530
>
>
> # LOSS RIGHT WRONG ENTS SKIP ACCURACY
> 01 7881.630 86 571 32921 0 0.131
> 02 5894.873 205 452 38230 0 0.312
> 03 5179.047 121 536 32562 0 0.184
> 04 4532.524 196 461 34669 0 0.298
> 05 4177.853 207 450 30510 0 0.315
> 06 3960.036 224 433 30300 0 0.341
> 07 3740.007 222 435 30841 0 0.338
> 08 4074.908 239 418 34695 0 0.364
> 09 3990.998 238 419 35026 0 0.362
> 10 3866.311 244 413 31125 0 0.371
>
> Correct 244
> Incorrect 413
> Baseline 0.061
> Accuracy 0.371
>
> Model: C:\model
> Training data: C:\model\training.jsonl
> Evaluation data: C:\model\evaluation.jsonl
When I annotate, Prodigy suggests things like periods and parenthesis and numbers as names still! Am I doing something wrong? What can I do to improve my result?
As a separate question, I annotated in several sessions. It seems to me that every time it starts from the beginning of the paragrafos.txt file again, as I keep seeing the same sentences over again. Do I have to annotate everything in a single session?
Thank you!