Error while trying to save large annotated documents.

Hello Prodigy team,

We are investigating a very interesting saving error that happens on our Prodigy instances when working with very large text documents.
Error message:
image
Here is what our specs are:
Docker Image running in Kubernetes pod
Build with Python 3.10.11
Using Prodigy 1.12.7
DB: MySQL 8
Recipe: ner.manual with --highlight-chars

Here is what we did and observed:
We noticed that when a user gets to documents with very large sizes, above 18 000 characters the mentioned error pops up.

To replicate that error we created a dataset with documents having 18 000+ characters.
Then we checked the logs using PRODIGY_LOGGING=basic/PRODIGY_LOGGING=verbose and we found the following when the first large document loads.

...32800, 'end': 32801, 'id': 27860, 'ws': False}, {'text': 'g', 'start': 32801, 'end': 32802, 'id': 27861, 'ws': False}], '_view_id': 'ner_manual'}], 'total': 0, 'progress': 0.357, 'session_id': '2023-11-15_13-30-46'}
INFO:     100.99.30.69:37200 - "POST /get_session_questions HTTP/1.0" 200 OK

The user then selects any answer in the Prodigy UI and tries to save the document, then the error appears.

No information in the logs is shown for what can be the cause, it just sits on 200 OK and the UI cannot save any documents past the one that caused the issue.

If another user session is created the work continues untill another large document is found and so on.

Its interesting that we made a test on a local machine using Python environment with Prodigy client of the same version (no docker, no kuberntes) and we could not reproduce the same issue as the documents were saved properly without any errors into the MySQL DB.

Do you have any idea from your end what can be the reason for this issue?

The type of the Example content column in our Database for MySQL is blob. This is a legacy decision with the Database but it does mean the column supports only a limited number of bytes.

There's nothing stopping you from changing the type of the column in Prodigy though once it has been created.

Running this SQL statement against your database should suffice. (I think largeblob is also available for really long data)
ALTER TABLE example MODIFY content mediumblob;

If a row was created for this too-long document you will also need to remove it from the database.

This happens sometimes and we do actually have an error that is supposed to be raised. If you see that error pop up in the UI it means the Prodigy server had some error that was hit so I'm surprised your logs didn't show anything here.

Are you able to reproduce this example and share the full logs with PRODIGY_LOGGING=verbose?

Hello @kab

We are aware of large documents getting cut if to large and for Example table we have already set column content to mediumblob , as I said when trying to reproduce this error locally it does save the large documents, meaning that it's not a database issue.

For the logs part I confirm that we see nothing when using PRODIGY_LOGGING=verbose. The file is loaded each character gets tokenized and then the user selects an answer and saves the session progress, the UI error then appears, resulting in an empty response in the logs.

'start': 32781, 'end': 32782, 'id': 27844, 'ws': False}, {'text': 'm', 'start': 32782, 'end': 32783, 'id': 27845, 'ws': False}, {'text': 'a', 'start': 32783, 'end': 32784, 'id': 27846, 'ws': False}, {'text': 'n', 'start': 32784, 'end': 32785, 'id': 27847, 'ws': False}, {'text': 'd', 'start': 32785, 'end': 32786, 'id': 27848, 'ws': True}, {'text': 'a', 'start': 32787, 'end': 32788, 'id': 27849, 'ws': False}, {'text': 'n', 'start': 32788, 'end': 32789, 'id': 27850, 'ws': False}, {'text': 'd', 'start': 32789, 'end': 32790, 'id': 27851, 'ws': True}, {'text': 'e', 'start': 32791, 'end': 32792, 'id': 27852, 'ws': False}, {'text': 'a', 'start': 32792, 'end': 32793, 'id': 27853, 'ws': False}, {'text': 's', 'start': 32793, 'end': 32794, 'id': 27854, 'ws': False}, {'text': 'i', 'start': 32794, 'end': 32795, 'id': 27855, 'ws': False}, {'text': 'n', 'start': 32795, 'end': 32796, 'id': 27856, 'ws': False}, {'text': 'g', 'start': 32796, 'end': 32797, 'id': 27857, 'ws': True}, {'text': 'c', 'start': 32798, 'end': 32799, 'id': 27858, 'ws': False}, {'text': 'o', 'start': 32799, 'end': 32800, 'id': 27859, 'ws': False}, {'text': 'n', 'start': 32800, 'end': 32801, 'id': 27860, 'ws': False}, {'text': 'g', 'start': 32801, 'end': 32802, 'id': 27861, 'ws': False}], '_view_id': 'ner_manual'}], 'total': 0, 'progress': 0.357, 'session_id': '2023-11-15_13-30-46'}
INFO:     100.99.30.69:37200 - "POST /get_session_questions HTTP/1.0" 200 OK

On local when the user does the save action we get past the /get_session_questions/ line and everything is working as intended.

The only difference we see when comparing local vs docker in kubernetes pod is the following:

on docker:

INFO:     100.99.30.69:37200 - "POST /get_session_questions HTTP/1.0" 200 OK

on local:

INFO:     127.0.0.1:36752 - "POST /get_session_questions HTTP/1.1" 200 OK

HTTP/1.0 vs HTTP/1.1

We are currently investigating if this can cause any problems and we are also trying to extract some logs of our own outside of prodigy in order to understand what may be the issue here.

Thanks for the update.

Here are a few ideas of some things you can check:

  1. Check for any differences in the environment configurations: Ensure that the environment variables and configurations are the same in both your local and Docker setups.
  2. HTTP Version: You've noticed a difference in the HTTP versions (HTTP/1.0 in Docker vs HTTP/1.1 locally). This could potentially lead to differences in behavior. HTTP/1.1 supports chunked transfer encoding, which allows the sender to start transmitting dynamically-generated content before knowing the total size of the content. If your application relies on this feature, it could explain the issues you're seeing.
  3. Network issues: Check if there are any network policies or configurations that might be affecting the communication between the Prodigy server and the client.
  4. Memory constraints: Kubernetes pods have memory limits. If your application is trying to process a large document, it might be hitting these limits, causing the operation to fail. Check the memory usage and limits of your pod.
  5. Check Docker logs: If Prodigy is running inside a Docker container, there might be useful information in the Docker logs. You can view these logs using the docker logs <container_id> command.
  6. Prodigy version: Ensure that the Prodigy version is the same in both environments. If not, consider upgrading to the latest version.

Hope this helps!

1 Like

Hi @ryanwesslen ,

Thank you for all the guide points provided!

We were able to pinpoint the root of our problem.
It was in the frontend part, config setting.
client_max_body_size
We increased the size and it fixed the issue. :slight_smile:

1 Like