Performance with multi user

bayethiernodiop · June 17, 2020, 3:02pm

Hi everyone,

We created a multi-user system on top of prodigy and we use the prodigy command to launch new annotation with python subprocess, but when we have more than 4 users, it's too slow and crashes many times.
What would be the cost(material) to have 100 users annotating in parallel?
Thanks.

ines · June 17, 2020, 8:11pm

Did you profile it and if so, did you find the bottleneck? Do you run out of memory? Prodigy itself should be very lightweight, so memory consumption mostly comes down to the models you're loading. If you have enough memory and you want to run separate instances, it'd mostly come down to how many FastAPI APIs you can run on the machine at the same time. If you have a lot of users, you'd probably also have at least some working on the same data, so you could use a single instance with multi-user sessions.

bayethiernodiop · June 18, 2020, 7:17am

Ok, thanks for the clarification, for data overlapping we create batches from the original corpus of every project, and then affect different batches to different users, so we don't have that problem. All this is saved in a mongoDB. So that prodigy is used just to do the annotation and is not aware of the multi user system we have.
For every batch we create a dataset, I don't know if that can cause problem because we will have many datasets.

ines · June 18, 2020, 9:08am

This sounds like a reasonable approach By batch, do you mean, annotation session/instance you start? In general, having more smaller datasets should perform better than having one giant dataset, or instance, so I don't think this would be a problem.

But maybe try and profile your setup and see if you can find where the problems occur. If you're calling into Prodigy's DB manually to extract the annotated data, maybe your script fires too often? Or maybe you're starting too many processes, or not terminating processes etc.?

bayethiernodiop · June 23, 2020, 2:57pm

By batch I mean 50 to 100 examples to annotate.
So to launch prodigy for every user, you suggest that it's better to juste call the prodigy shell command with something like subprocess in python ? Or would it be better to use Prodigy API if that's even possible in this case.
Thanks in advance.

ines · June 24, 2020, 12:01pm

Not sure if it makes a difference – personally, I always feel like calling something in a subprocess from within Python adds another layer of abstraction and can make things harder to debug, so if it's no problem to use shell commands and shell scripts, I'd stick to doing that.

After an annotation process is completed, do you shut it down again? Maybe something goes wrong there and you end up with all these additional processes running? Maybe also look at top and check your memory usage, this should give you an idea of what part might be causing the problem.

bayethiernodiop · June 24, 2020, 1:16pm

Ok I see but how would you launch shell commands without using subprocess inside the python code, since the command is dependent on some variable and also you need something to launch the prodigy UI.

When I was talking about API I thought that prodigy has an API to use inside the code to launch annotation UI without using the shell command.

On last question : Is there anyway for me to launch annotation by giving text content from variable and not giving the jsonl file ?

Thanks

ines · June 24, 2020, 1:27pm

Ah, maybe I misunderstood your question? I thought you were currently using only shell scripts to launch Prodigy instances and were considering doing it from within Python.

Yes, that's prodigy.serve: Components and Functions · Prodigy · An annotation tool for AI, Machine Learning & NLP

If you're running prodigy.serve, you can just substitute your variable in the string you pass to it. If you're using shell scripts, you can use environment variables. (This is both not specific to Prodigy.)

Prodigy can also read from standard input: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP

bayethiernodiop · June 24, 2020, 1:47pm

Ah sorry for not being clear.
We have developed a python app to handle multi-user, so when the user clicks the button to annotate a batch of examples, we launch the prodigy APP from the python code with the corresponding data, my question was which approach is better in term of performance and organization:

Use subprocess to call prodigy app for every user
Use the prodigy API to launch the Prodigy app for each user.

ines · June 25, 2020, 9:51am

I don't think there's really a difference and what you currently have sounds fine

bayethiernodiop · June 25, 2020, 10:09am

Thanks for all your responses

bayethiernodiop · June 26, 2020, 10:30pm

you mean that instead of this :

prodigy.serve("ner.manual ner_news_headlines en_core_web_sm ./news_headlines.jsonl --label PERSON,ORG", port=9000)

I could have this
texts_to_annotate_str = "..."

prodigy.serve(f"ner.manual ner_news_headlines en_core_web_sm {texts_to_annotate_str} --label PERSON,ORG", port=9000)

ines · June 27, 2020, 9:58am

Ah, no, the CLI command here does expect a file. But since you're in Python, you could just write a temporary file?

Topic		Replies	Views
Named multi-user session exceeds dataset length usage , streams	1	584	January 2, 2022
Repeated Examples in Multi-user Sessions streams	7	959	January 24, 2023
Several annotators usage , solved , streams	1	388	May 26, 2020
Multiple datasets in one session usage	1	799	June 24, 2019
multiple users session with feed_overlap set to false	2	141	March 6, 2024

Performance with multi user

Related topics