ner.teach Updates and expected changes in the scores

I'm experimenting with bootstrapping ICT terms in German texts using ner.teach, an existing large German Spacy model and a set of patterns derived from a relatively large ICT term list (3000 entries). I'm observing now that spans with the same frequent false positive mention are proposed with the same scores (always 0.5, it changes never) all the time, although I have rejected dozens of items already, saved the annotations, and restarted the process. What is the expected change of the scores? When do the updates happen? On saving annotations? Or am I missing something basic?
Thanks for any hints.

Hi! The answers are sent back to the server in batches so it'll typically take about 20-30 annotations to see a change in example selection as the next batches of examples scored by the updated model come in. In general, the default configuration using the prefer_uncertain sorter will send out examples with the most uncertain scores so if there are examples with an exact 0.5 scores, those are the ones you'll be seeing.

It's definitely suspicious, though, if there's no change in scores at all and you're only ever seeing 0.5 and not even a 0.49. How many labels do you have in total? And are you starting off with a pretrained model or are you doing a "cold start" with only patterns?

Another thing to keep in mind is that when you restart the server and you already have some annotations, you typically want to pretrain a model artifact with the existing binary annotations, so you can start off with a model that knows at least something about the task. This will essentially be a better version of the model updated in the loop (since you're batch training and can use various tricks like making multiple passes over the data, dropout etc.).

Thanks for your reply. I have only 1 label in the data, and the score is consistently 0.5. I was doing a cold start with patterns only. I tried using a large German model and also a small one. Also, I tried with the nightly prodigy build and the regular one, no difference in the behaviour was visible.
I'll test with training a prodigy model on the binary data.

In the meantime, I was able to train a domain-pretrained transformer-based model with spacy on a rather small gold standard (68 documents) that was produced outside of prodigy, and this model performed pretty well (F-Score 82). So we are thinking that the ner.teach approach maybe isn't the best way to bootstrap a new entity category. Rather fully annotating a few dozens of carefully selected example documents, training a spaCy transformer model and then investing in ner.correct via prodigy seems to be more targeted. (That is also much closer to what you, Ines, recommended in the newer NER bootstrapping videos.) But I liked the old idea of going through standard bootstrap cycle: term.teach,, ner.teach, train ...