We created a multi-user system on top of prodigy and we use the prodigy command to launch new annotation with python subprocess, but when we have more than 4 users, it's too slow and crashes many times.
What would be the cost(material) to have 100 users annotating in parallel?
Thanks.
Did you profile it and if so, did you find the bottleneck? Do you run out of memory? Prodigy itself should be very lightweight, so memory consumption mostly comes down to the models you're loading. If you have enough memory and you want to run separate instances, it'd mostly come down to how many FastAPI APIs you can run on the machine at the same time. If you have a lot of users, you'd probably also have at least some working on the same data, so you could use a single instance with multi-user sessions.
Ok, thanks for the clarification, for data overlapping we create batches from the original corpus of every project, and then affect different batches to different users, so we don't have that problem. All this is saved in a mongoDB. So that prodigy is used just to do the annotation and is not aware of the multi user system we have.
For every batch we create a dataset, I don't know if that can cause problem because we will have many datasets.
This sounds like a reasonable approach By batch, do you mean, annotation session/instance you start? In general, having more smaller datasets should perform better than having one giant dataset, or instance, so I don't think this would be a problem.
But maybe try and profile your setup and see if you can find where the problems occur. If you're calling into Prodigy's DB manually to extract the annotated data, maybe your script fires too often? Or maybe you're starting too many processes, or not terminating processes etc.?
By batch I mean 50 to 100 examples to annotate.
So to launch prodigy for every user, you suggest that it's better to juste call the prodigy shell command with something like subprocess in python ? Or would it be better to use Prodigy API if that's even possible in this case.
Thanks in advance.
Not sure if it makes a difference – personally, I always feel like calling something in a subprocess from within Python adds another layer of abstraction and can make things harder to debug, so if it's no problem to use shell commands and shell scripts, I'd stick to doing that.
After an annotation process is completed, do you shut it down again? Maybe something goes wrong there and you end up with all these additional processes running? Maybe also look at top and check your memory usage, this should give you an idea of what part might be causing the problem.
Ok I see but how would you launch shell commands without using subprocess inside the python code, since the command is dependent on some variable and also you need something to launch the prodigy UI.
When I was talking about API I thought that prodigy has an API to use inside the code to launch annotation UI without using the shell command.
On last question : Is there anyway for me to launch annotation by giving text content from variable and not giving the jsonl file ?
Ah, maybe I misunderstood your question? I thought you were currently using only shell scripts to launch Prodigy instances and were considering doing it from within Python.
If you're running prodigy.serve, you can just substitute your variable in the string you pass to it. If you're using shell scripts, you can use environment variables. (This is both not specific to Prodigy.)
Ah sorry for not being clear.
We have developed a python app to handle multi-user, so when the user clicks the button to annotate a batch of examples, we launch the prodigy APP from the python code with the corresponding data, my question was which approach is better in term of performance and organization:
Use subprocess to call prodigy app for every user
Use the prodigy API to launch the Prodigy app for each user.