I trained the model many times with many dataset on prodigy and after I added the last annotated dataset the score dropped down more than 20% . Could please help to know where is the problem ?
Unfortunately, it's really hard to know without knowing a lot of context, for example:
what's your total sample size? adding 20% of 100 annotations won't do much, while 20% of 10,000 annotations we'd expect more robust results. From a glance, it looks like you may have a good number of annotations, so this may not be the sole reason.
were your new annotations inconsistent? (e.g., did you add new annotators and if so, did you provide them clear annotation guidelines or could they have added noise into their annotations)
related, how good quality are all of your annotations? did you do any "gold" / 2nd round reviews of your initial annotations (i.e., this is what the spans.correct recipe is for)?
I see you labeled with spans.manual yet are training with --ner -- why is that? Be careful as if you add overlapping entities with spans.manual you won't be able to train with --ner. This likely may not be the cause but it's something worth knowing to avoid problems in your workflow.
did you add any new labels? did you have any low record labels (e.g., let's say you had a span type with only 5 examples in your original run but got 100% F1, but then you added 5 more and now it's 50% but simply due to sample size)
have you run any span characteristics diagnostics? For example, see spacy data debug including the note on "span characteristics". This can have an effect on performance due to too long of spans and/or defining span boundaries.
Likely the biggest problem I see is that you're not using dedicated evaluation sets. We mention this in prodigy traindocs:
For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset . If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.
If you don't specify a dedicated evaluation dataset, each time you rerun prodigy train, it'll shuffle the evaluation dataset. In this way, it can seem that the performance changed but the problem is you're changing how you're evaluating.
Here's a post where I go through some of that workflow:
Alternatively, you may want to consider switching to spacy train instead of prodigy train via the data-to-spacy recipe. This has many advantages as it'll create the dedicated evaluation sets (as spaCy binary files) and a config that you can modify if you want to do hyperparameter tuning down the road:
Have you tried train-curve? This recipe was developed to handling debugging "when to stop" annotation. This too would be best when using a dedicated evaluation dataset.
As I also mentioned, if you run data-to-spacy, running spacy data debug can get span characteristics. If it helps, the spaCy team has a great template project on evaluating spancat:
I tried to run span.correct and i got error also i did not run any gold
✘ Can't find recipe or command 'span.correct'.
Run prodigy --help to see available options. If you're using a custom recipe,
provide the path to the Python file using the -F argument.
Same problem the score not grow more than 36% on the server on AWS another server reach 68% on VMware also in vmwere I am using not leatst prodigy version . what could be the issue ?
Is there any OS requirement or hardware requirement on AWS ?
All annotations done EC2 AWS using Ubuntu score not reach more 36%.
Three servers with same issue on EC2 AWS.
Do you offer any direct support as the problem very critical only happen on EC2 servers . We reinstalled the OS three times and same problem. any data annotated on EC2 give very small score but the data which was annotated on VMware give good score. I uninstall prodigy and spacy and I installed old version still same problem.
Kindly please help as now all our project stop and we can not find any documentation on the net if you can provide direct support please let me know.
Just curious - did you try to create the dedicated holdout dataset as I mentioned before? I mention because if you don't do this, this could give the illusion that there's differences in training when it really isn't (what's only changing is how you're evaluating the model).
Can you provide a reproducible example so that we can debug the problem?
What we'd need:
SpaCy / Prodigy versions: spacy info and prodigy stats
Sample of training data. Perhaps this could be 5-10 records.
spaCy and/or Prodigy commands you're running, including spaCy config files.
Version 1.11.12
Location /usr/local/lib/python3.8/dist-packages/prodigy
Prodigy Home /root/.prodigy
Platform Linux-5.4.0-137-generic-x86_64-with-glibc2.29
Python Version 3.8.10
Database Name SQLite
Database Id sqlite
Total Datasets 2
Total Sessions 3
spaCy version 3.4.0
Location /usr/local/lib/python3.8/dist-packages/spacy
Platform Linux-5.4.0-137-generic-x86_64-with-glibc2.29
Python version 3.8.10
Pipelines en_core_web_sm (3.4.1)
I have not use debag before and I did not create any config.cfg I am using default configuration.
For training Iam using below commends :
python3 -m prodigy train --ner EDUCV1_50
and I run review commend the annotation almost correct.
We are using VNC and RDP to access remote servers where prodigy installed could be the issue that labels position not saved correctly or screen resolution making problem with labels position ?
For the EC2 vs VMware: do they have different operating systems?
I saw one of the OS you posted was: Platform Linux-5.4.0-137-generic-x86_64-with-glibc2.29. Was that EC2 or VMware? Can you provide the same Python/spaCy/Prodigy versions?
Version 1.11.8
Location /usr/local/lib/python3.8/dist-packages/prodigy
Prodigy Home /root/.prodigy
Platform Linux-5.15.0-69-generic-x86_64-with-glibc2.29
Python Version 3.8.10
Database Name SQLite
Database Id sqlite
Total Datasets 11
Total Sessions 51
root@AIServer:~# python3 -m spacy info
============================== Info about spaCy ==============================
spaCy version 3.4.3
Location /usr/local/lib/python3.8/dist-packages/spacy
Platform Linux-5.15.0-69-generic-x86_64-with-glibc2.29
Python version 3.8.10
Pipelines en_core_web_sm (3.4.1), en_core_web_lg (3.4.1), en_core_web_md (3.4.1)
I discussed with a few teammates and we did find that with different OS's, it is possible to have a different random seed that will cause a different partitioning across OS's. We're putting in a fix to resolve it so that the seed is fixed across all OS's.
However, generally, we find that reshuffling the partitions shouldn't have major performance results unless you have small samples or there's a lot of noise or inconsistencies in your annotations.
Our suggestion is still the same that to prevent issues, you should consider using a dedicated holdout dataset. For example, separate out your dataset into two datasets, train_data and eval_data, and then run prodigy train --ner train_data,eval:eval_data ... so that you holdout is consistent.
If you are seeing major variations due to partitioning, that may suggest noise or inconsistency in your annotations. This should be a bit of a warning that in production, your model may not perform as well as you expect.
I run the commend prodigy train --ner train_data,eval:eval_data same the score low and I change to godday server hosting the score with same setup the score now reach 51% and stop increasing I added extra data and the score moving vary slow .
But in VMwhere if I am doing the annotation there and run the training the score cross 78% .
Is there any OS requirements as I can see if I am doing the annotation in different servers type I got different scores.
No. To my knowledge, the OS won't affect the act of annotating (i.e., making the annotations).
If you want to test, label 10 train records and 10 eval records in either environments. This way you have 4 .jsonl files (2 train, 2 eval, each from the two different environments). Run training in both environments. Now export those .jsonl and import them into the different environment (e.g., using db-in). Then run training again. What are the results?
I am using to connect to Remote server desktop using VNC to access remote ubuntu server where prodigy installed Here the data annotated on the remote server using VNC desktop EDUCV1_50.jsonl (1.9 MB)
And here the annotated data using I am using windows remote desktop RDP to access remote ubuntu server where prodigy installed :
Thanks -- can you provide the four .jsonl files. Two train / two eval, each from the different separate environments. Also, please provide the exact commands you're using to train.