I’m running customized textcat.teach recipe with multiple labels. Since my stream is a combination of several doc sources I would like to double check if I’m getting all the docs I want. I would like to be able to see all the documents not just the ones with scores close to 0.5. Is there any way I can ask prodigy to display all the docs, even the ones with really high scores?
Hi! I think the easiest way would be to remove the
prefer_uncertain sorter from the recipe. The source of the built-in recipes is included with Prodigy, and you can find the location of your Prodigy installation like this:
python -c "import prodigy; print(prodigy.__file__)"
Then find the following line in the
teach recipe / function:
stream = prefer_uncertain(predict(stream))
And change it to:
stream = predict(stream)
This will obviously make the recipe a lot less useful, so you should really only do this for debugging purposes.
If you just want to check things and test that your stream is read in correctly, you could also use the
mark recipe that will just stream in all examples in order:
prodigy mark some_dataset your_data.jsonl
Thank you for your help so far! This worked perfectly!
I have another question, I’m trying to understand how does the batch_size work.
I’m using the code below to generate stream:
s1 = first_query() #there are only 10 docs
yield from s1
s2 = second_query() #there are 1000's of docs
yield from s2
stream = stream_generator()
I’m also using:
result['config']['batch_size'] in my code.
I adjust the recipe so it ignores the scores and displays all the docs (I just wanted to make sure my stream is working correctly, but I will use the build in active learning). Now, when I set batch_size to 30 I can see all the docs from s1 and then it moves to s2, but when the batch_size is less than 10, it only shows me few docs from s1 (less than the batch_size is set to) and then I get a message that there are no more documents. Once I refresh the page it starts over with few other docs from s1 and shows
"no more documents" afterwards. It never progresses to docs from s2.
The number of docs in s1 will vary (won’t always be 10) but will always be smaller than number of docs in s2. I want to make sure that docs from s1 are labeled before docs from s2, but also that docs from s2 are in the stream, I’m trying to understand how are the documents generated?
I’m ignoring the scores so I would have thought that I will see all the docs? But I only see all s1 docs if the batch size is larger than the number of docs in s1 and then there is a smooth transition to s2. Otherwise I see less than the batch size is set to of s1 type docs and then it never progresses to s2. I’m not sure why docs from s2 are never generated?
Could you please advise me on what the best solution here would be?
Your example looks very reasonable, so I’m confused why you’re seeing this type of behaviour. I’ll have a look and see if I can reproduce this. I suspect that maybe something causes your generator to be consumed before it’s done with all the tasks. What do your
second_query functions do, roughly?
Some background on batching in Prodigy: The
batch_size essentially defines the number of examples to split the stream into, and also the number of examples the web app will receive at once. Every time the web app requests new tasks, the Prodigy server will respond by asking the partitioned stream for the next batch, and will send it back to the app.
Here’s an abstract example that illustrates how this works under the hood:
from toolz.itertoolz import partition
batch_size = 10
yield from ('a' for i in range(10))
yield from ('b' for x in range(50))
stream = stream_generator()
batched_stream = partition(batch_size, stream)
Thanks you, that’s very useful.
second_query are both ElasticSearch queries that return results via
.scan() so I’m not sure if that might have any impact on it.
Ah, okay – I’m not that familiar with ElasticSearch, so I’m not sure, but it’s definitely possible that there’s a connection here.
Would it be an option to execute the queries on startup, or would this take too long? So, instead of doing it all in the generator, you execute fetch the query results first and then use that as the stream? A few thousand documents should be fine and shouldn’t add too much overhead. It’ll also make debugging easier, because you can just inspect the stream and you always know exactly what Prodigy receives (and won’t have to make guesses about what happens inside your stream generator).
Here’s an abstract example – for simplicity, I’m just assuming that the queries are both generator functions:
s1 = list(first_query())
s2 = list(second_query())
stream = s1 + s2