Annotation score drops

I trained the model many times with many dataset on prodigy and after I added the last annotated dataset the score dropped down more than 20% . Could please help to know where is the problem ?

python3 -m prodigy spans.manual CV_DESIGV2_200 blank:en ./200_cv_Json_format.jsonl --label Designation,CompaniesworkedAt,Location,Desig_Experience --patterns patterns_desg.jsonl

python3 -m prodigy train ./CV_DESIGTV2pt1_2000 --ner CV_DESIGV1,CV_DESIGV2_200,CV_DESIGV3_300,CV_DESIGV4_800,CV_DESIGV2_2000--base-model en_core_web_lg

hi @Mohammad!

Thanks for your question.

Unfortunately, it's really hard to know without knowing a lot of context, for example:

  • what's your total sample size? adding 20% of 100 annotations won't do much, while 20% of 10,000 annotations we'd expect more robust results. From a glance, it looks like you may have a good number of annotations, so this may not be the sole reason.
  • were your new annotations inconsistent? (e.g., did you add new annotators and if so, did you provide them clear annotation guidelines or could they have added noise into their annotations)
  • related, how good quality are all of your annotations? did you do any "gold" / 2nd round reviews of your initial annotations (i.e., this is what the spans.correct recipe is for)?
  • I see you labeled with spans.manual yet are training with --ner -- why is that? Be careful as if you add overlapping entities with spans.manual you won't be able to train with --ner. This likely may not be the cause but it's something worth knowing to avoid problems in your workflow.
  • did you add any new labels? did you have any low record labels (e.g., let's say you had a span type with only 5 examples in your original run but got 100% F1, but then you added 5 more and now it's 50% but simply due to sample size)
  • have you run any span characteristics diagnostics? For example, see spacy data debug including the note on "span characteristics". This can have an effect on performance due to too long of spans and/or defining span boundaries.

Likely the biggest problem I see is that you're not using dedicated evaluation sets. We mention this in prodigy train docs:

For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset . If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.

If you don't specify a dedicated evaluation dataset, each time you rerun prodigy train, it'll shuffle the evaluation dataset. In this way, it can seem that the performance changed but the problem is you're changing how you're evaluating.

Here's a post where I go through some of that workflow:

Alternatively, you may want to consider switching to spacy train instead of prodigy train via the data-to-spacy recipe. This has many advantages as it'll create the dedicated evaluation sets (as spaCy binary files) and a config that you can modify if you want to do hyperparameter tuning down the road:

Have you tried train-curve? This recipe was developed to handling debugging "when to stop" annotation. This too would be best when using a dedicated evaluation dataset.

As I also mentioned, if you run data-to-spacy, running spacy data debug can get span characteristics. If it helps, the spaCy team has a great template project on evaluating spancat:

I tried to run span.correct and i got error also i did not run any gold

✘ Can't find recipe or command 'span.correct'.
Run prodigy --help to see available options. If you're using a custom recipe,
provide the path to the Python file using the -F argument.

Sorry - typo on my end. It is spans.correct, not span.correct. See the docs.

Same problem the score not grow more than 36% on the server on AWS another server reach 68% on VMware also in vmwere I am using not leatst prodigy version . what could be the issue ?

Is there any OS requirement or hardware requirement on AWS ?

All annotations done EC2 AWS using Ubuntu score not reach more 36%.
Three servers with same issue on EC2 AWS.

Do you offer any direct support as the problem very critical only happen on EC2 servers . We reinstalled the OS three times and same problem. any data annotated on EC2 give very small score but the data which was annotated on VMware give good score. I uninstall prodigy and spacy and I installed old version still same problem.

Kindly please help as now all our project stop and we can not find any documentation on the net if you can provide direct support please let me know.

hi @Mohammad,

Sorry you're having issues.

Just curious - did you try to create the dedicated holdout dataset as I mentioned before? I mention because if you don't do this, this could give the illusion that there's differences in training when it really isn't (what's only changing is how you're evaluating the model).

Can you provide a reproducible example so that we can debug the problem?

What we'd need:

  • SpaCy / Prodigy versions: spacy info and prodigy stats
  • Sample of training data. Perhaps this could be 5-10 records.
  • spaCy and/or Prodigy commands you're running, including spaCy config files.

Version 1.11.12
Location /usr/local/lib/python3.8/dist-packages/prodigy
Prodigy Home /root/.prodigy
Platform Linux-5.4.0-137-generic-x86_64-with-glibc2.29
Python Version 3.8.10
Database Name SQLite
Database Id sqlite
Total Datasets 2
Total Sessions 3

spaCy version 3.4.0
Location /usr/local/lib/python3.8/dist-packages/spacy
Platform Linux-5.4.0-137-generic-x86_64-with-glibc2.29
Python version 3.8.10
Pipelines en_core_web_sm (3.4.1)

Thank you! What commands were you running?

I have not use debag before and I did not create any config.cfg I am using default configuration.

For training Iam using below commends :
python3 -m prodigy train --ner EDUCV1_50

and I run review commend the annotation almost correct.

We are using VNC and RDP to access remote servers where prodigy installed could be the issue that labels position not saved correctly or screen resolution making problem with labels position ?

For the EC2 vs VMware: do they have different operating systems?

I saw one of the OS you posted was: Platform Linux-5.4.0-137-generic-x86_64-with-glibc2.29. Was that EC2 or VMware? Can you provide the same Python/spaCy/Prodigy versions?

Same they have Ubuntu OS but deffrant version on VMware we have :
Ubuntu 20.04.5 LTS & 8vCPU 32GB RAM

============================== :sparkles: Prodigy Stats ==============================

Version 1.11.8
Location /usr/local/lib/python3.8/dist-packages/prodigy
Prodigy Home /root/.prodigy
Platform Linux-5.15.0-69-generic-x86_64-with-glibc2.29
Python Version 3.8.10
Database Name SQLite
Database Id sqlite
Total Datasets 11
Total Sessions 51

root@AIServer:~# python3 -m spacy info

============================== Info about spaCy ==============================

spaCy version 3.4.3
Location /usr/local/lib/python3.8/dist-packages/spacy
Platform Linux-5.15.0-69-generic-x86_64-with-glibc2.29
Python version 3.8.10
Pipelines en_core_web_sm (3.4.1), en_core_web_lg (3.4.1), en_core_web_md (3.4.1)

hi @Mohammad,

I discussed with a few teammates and we did find that with different OS's, it is possible to have a different random seed that will cause a different partitioning across OS's. We're putting in a fix to resolve it so that the seed is fixed across all OS's.

However, generally, we find that reshuffling the partitions shouldn't have major performance results unless you have small samples or there's a lot of noise or inconsistencies in your annotations.

Our suggestion is still the same that to prevent issues, you should consider using a dedicated holdout dataset. For example, separate out your dataset into two datasets, train_data and eval_data, and then run prodigy train --ner train_data,eval:eval_data ... so that you holdout is consistent.

If you are seeing major variations due to partitioning, that may suggest noise or inconsistency in your annotations. This should be a bit of a warning that in production, your model may not perform as well as you expect.

Hope this helps!

I run the commend prodigy train --ner train_data,eval:eval_data same the score low and I change to godday server hosting the score with same setup the score now reach 51% and stop increasing I added extra data and the score moving vary slow .

But in VMwhere if I am doing the annotation there and run the training the score cross 78% .
Is there any OS requirements as I can see if I am doing the annotation in different servers type I got different scores.

No. To my knowledge, the OS won't affect the act of annotating (i.e., making the annotations).

If you want to test, label 10 train records and 10 eval records in either environments. This way you have 4 .jsonl files (2 train, 2 eval, each from the two different environments). Run training in both environments. Now export those .jsonl and import them into the different environment (e.g., using db-in). Then run training again. What are the results?

I run annotation to remote server with same data using RDP and VNC the different scores

1-Using RDP
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE


0 0 0.00 252.38 0.00 0.00 0.00 0.00
5 200 4044.88 7001.15 18.67 31.82 13.21 0.19
10 400 1550.07 810.52 18.67 31.82 13.21 0.19
15 600 1112.31 401.57 19.44 36.84 13.21 0.19
20 800 3586.26 243.60 24.62 66.67 15.09 0.25
25 1000 1464.65 161.26 25.00 47.37 16.98 0.25
30 1200 183.65 79.56 21.21 53.85 13.21 0.21
35 1400 62.24 23.93 17.39 37.50 11.32 0.17
40 1600 90.74 25.76 22.22 42.11 15.09 0.22
45 1800 268.91 67.89 21.54 58.33 13.21 0.22
50 2000 254.14 69.72 26.32 43.48 18.87 0.26
55 2200 155.94 39.29 16.44 30.00 11.32 0.16
60 2400 218.54 48.10 11.59 25.00 7.55 0.12
65 2600 5906.46 119.43 21.21 53.85 13.21 0.21
70 2800 198.74 43.29 17.65 40.00 11.32 0.18
75 3000 1163.07 60.53 20.90 50.00 13.21 0.21
80 3200 54435.52 123.18 11.59 25.00 7.55 0.12
85 3400 194.91 34.18 17.39 37.50 11.32 0.17
90 3600 487.63 58.87 25.00 47.37 16.98 0.25

Using VNC :
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE


0 0 0.00 378.08 0.00 0.00 0.00 0.00
5 200 5122.72 8761.09 11.11 18.75 7.89 0.11
11 400 901.47 1058.81 17.02 44.44 10.53 0.17
16 600 6372.11 679.45 15.69 30.77 10.53 0.16
22 800 119.18 127.62 14.81 25.00 10.53 0.15
27 1000 156.47 85.57 15.69 30.77 10.53 0.16
33 1200 144.94 61.29 21.82 35.29 15.79 0.22
38 1400 109.22 43.75 25.71 28.12 23.68 0.26
44 1600 1505.29 90.05 12.50 30.00 7.89 0.12
50 1800 297.38 51.94 19.23 35.71 13.16 0.19
55 2000 39726.22 233.98 26.92 50.00 18.42 0.27
61 2200 2278.38 107.04 21.43 33.33 15.79 0.21
66 2400 4786.93 89.19 37.04 62.50 26.32 0.37
72 2600 106.03 27.39 28.57 36.00 23.68 0.29
77 2800 115.66 29.22 23.33 31.82 18.42 0.23
83 3000 535.06 48.82 26.42 46.67 18.42 0.26
88 3200 1587.46 78.36 22.22 37.50 15.79 0.22
94 3400 276.66 64.37 21.82 35.29 15.79 0.22
100 3600 540.11 49.03 24.56 36.84 18.42 0.25
105 3800 179.43 35.59 15.38 28.57 10.53 0.15
111 4000 199.21 33.09 24.14 35.00 18.42 0.24

Can you provide the .jsonl files, the OS (what does RDP and VNC mean?), and the exact (prodigy train) commands you ran?

I am using to connect to Remote server desktop using VNC to access remote ubuntu server where prodigy installed Here the data annotated on the remote server using VNC desktop
EDUCV1_50.jsonl (1.9 MB)

And here the annotated data using I am using windows remote desktop RDP to access remote ubuntu server where prodigy installed :

EDUCV1_50_rdp.jsonl (2.1 MB)

Thanks -- can you provide the four .jsonl files. Two train / two eval, each from the different separate environments. Also, please provide the exact commands you're using to train.

eval_data_EDUCV1_50_vnc.jsonl (536.4 KB)
eval_data_EDUCV1_50_vnc.jsonl (536.4 KB)
train_data_EDUCV1_50_vnc.jsonl (1.9 MB)
train_data_rdp.jsonl (1.9 MB)

Please find attached files