Understanding high level flow of Prodigy for text classification

Hello guys,
Some background on what i am trying to do
I am creating a custom recipe for sentence categorization to categorize for e.g. “Does the sentence imply the user blaming his employer?” and want to use the new sentence encoding models (infersent etc) instead of the spacy categ model. I want to source example from my SQL Server database .
Here is what I am hoping to achieve with prodigy.

  1. Have an initial set of 3 positive and 3 negative example sentences added to prodigy dataset manually
  2. use infersent to get sentence vectors for all examples in dataset so far and train a classification model.
  3. Pull new input examples from my SQL database ,score them with my custom model and present them to user with the prefer_high_scores strategy- as I am expecting very few of positive class examples(may be 1 in 50)
  4. Have user tag a certain number of examples say 5
  5. Repeat steps 2 to 4.

Here is what I have done so far
1.I have created a custom recipe thats based on textcat.teach
2. Create an infersent and a sentence classification model when prodigy starts and is available for score and update method
2.Create an input stream to source my examples one at a time from my SQL DB
3. Pass my input stream to a custom scoreStream Function that yields (score, task) where score is based on my classification model
4. Pass the output of scoreStream to prodigys prefer_high_scores function
5. I created a custom updateFunction that i specified in the custom Recipe return . This function extracts all annotations from datasets and retrains my classification model.
6. In the customRecipe config I set ‘batch_size’:5
7. When I begin, my dataset is empty and I have ‘exclude’ set to None and am NOT using combine_models so there is no confusion and I can understand exactly whats going on.

Here is what I see is happening

  1. When prodigy starts it pulls 65 examples from my stream. Why 65?
  2. After I tag first 10 examples it seems like my custom update function is called. Why 10?
  3. Beyond the first 10 examples , I see that my customUpdate function is called after every 5 examples I tag which seems to be in line with the batch_size i have set to 5.
  4. After every 5 examples I tag, looks like 8-11 new examples are fetched from my stream. Whats the trigger for it to extract new examples?

So generally wanted to know what the flow is like?
whats the logic inside the prefer_high_scores is ? and
when does prodigy get additional examples from the source?
If I update my model every 5 annotations I am assuming that only these new examples will be scored using my new mode and not all the old ones that were rejected by prefer_high_scores say 30 annotations earlier. Is that right?

Thanks, those are all good questions! Your workflow and idea sounds good – answer below:

I suspect that because you’re using the prefer_high_scores sorter, this is how many examples it takes to fill up one full batch (5 examples) for annotation. Keep in mind that the “preference” sorters keep an exponential moving average to decide whether to show you an example or not. So it will skip examples with low scores, and yield out examples with high scores. So it’s possible that it took 65 examples from your stream to get a batch of 5 examples that meet the “high scores” requirement.

Annotated examples are sent back to the server in batches. One batch is always kept in the app to allow you to undo a previous decision (so this can be done straight in the app, instead of having the server do all the reconciling). Outgoing annotations are kept in the outbox, until the next buffer is full, and are then sent back to the server. So with a batch size of 5, you’d get the first examples after 10 annotations (once the buffer and history are full), and after that, on every full batch of 5. Once annotations are received, the update function is called.

New examples are fetched whenever the queue in the app is running low. This is calculated dynamically – either when the queue is lower than half the batch size or less than 2 examples, whichever happens first. The idea behind this is that if you annotate super fast, you should normally never hit an empty queue and the buffer should always be filled up in time.

Yes, that’s correct. The examples that are skipped by the sorter are ignored and every time new annotations come in, the model is updated with that batch of annotations. If you use the model to score the examples and the model weights change in the meantime, the updated model will be used to score the examples, which will then be sorted by the sorter, and so on.

This explains the following composition, made possible because Prodigy streams are generators:

stream = prefer_uncertain(model(stream))
update = model.update

At least, that’s how the built-in recipes are designed – you can always implement your own logic or a different active learning strategy in a custom recipe.

Btw, in case you haven’t seen it yet, another tip if you want to see what’s going on behind the scenes: You can set the environment variable PRODIGY_LOGGING=basic or PRODIGY_LOGGING=verbose to log everything that’s happening in the individual components. Verbose logging will also log the exact data that’s passed around (so you probably want to log it to a file).