Hello guys,
Some background on what i am trying to do
I am creating a custom recipe for sentence categorization to categorize for e.g. “Does the sentence imply the user blaming his employer?” and want to use the new sentence encoding models (infersent etc) instead of the spacy categ model. I want to source example from my SQL Server database .
Here is what I am hoping to achieve with prodigy.
- Have an initial set of 3 positive and 3 negative example sentences added to prodigy dataset manually
- use infersent to get sentence vectors for all examples in dataset so far and train a classification model.
- Pull new input examples from my SQL database ,score them with my custom model and present them to user with the prefer_high_scores strategy- as I am expecting very few of positive class examples(may be 1 in 50)
- Have user tag a certain number of examples say 5
- Repeat steps 2 to 4.
Here is what I have done so far
1.I have created a custom recipe thats based on textcat.teach
2. Create an infersent and a sentence classification model when prodigy starts and is available for score and update method
2.Create an input stream to source my examples one at a time from my SQL DB
3. Pass my input stream to a custom scoreStream Function that yields (score, task) where score is based on my classification model
4. Pass the output of scoreStream to prodigys prefer_high_scores function
5. I created a custom updateFunction that i specified in the custom Recipe return . This function extracts all annotations from datasets and retrains my classification model.
6. In the customRecipe config I set ‘batch_size’:5
7. When I begin, my dataset is empty and I have ‘exclude’ set to None and am NOT using combine_models so there is no confusion and I can understand exactly whats going on.
Here is what I see is happening
- When prodigy starts it pulls 65 examples from my stream. Why 65?
- After I tag first 10 examples it seems like my custom update function is called. Why 10?
- Beyond the first 10 examples , I see that my customUpdate function is called after every 5 examples I tag which seems to be in line with the batch_size i have set to 5.
- After every 5 examples I tag, looks like 8-11 new examples are fetched from my stream. Whats the trigger for it to extract new examples?
So generally wanted to know what the flow is like?
whats the logic inside the prefer_high_scores is ? and
when does prodigy get additional examples from the source?
If I update my model every 5 annotations I am assuming that only these new examples will be scored using my new mode and not all the old ones that were rejected by prefer_high_scores say 30 annotations earlier. Is that right?