Command "db-in" returns "MySQL server has gone away"

I'm trying:
prodigy db-in my_dataset pre_annotated.jsonl

It returns:

peewee.OperationalError: (2006, 'MySQL server has gone away')

This only happens with "db-in" but other commands run properly.
Did anybody else experience it? OR Is there anything I'm doing wrong? Thanks!

Hi! How many examples do you have in pre_annotated.jsonl? If the file is very large, it may be that the connection times out while the file is loaded into memory.

If the file is very large, the easiest way would probably be to just split it up into two or more files and run the command for each of them. Alternatively, you might also want to double-check your connect_timeout setting (see here). Finally, you could also edit the recipe script and move the DB = connect() further down so it's only called after the examples are loaded and right before they're added. (The main reason the recipe connects to the database first is that you'll immediately get an error if that doesn't work.)

pre_annotated.jsonl contains 124 examples (126.7 MB).
I have set connect_timeout = 100
Still not working!

Okay, so that's not a lot of examples, but they're all very large (~10mb per example). So it's likely related to that. What happens if you move the call to DB = connect() further down so it's only called after the examples are loaded and right before they're added?

I moved DB = connect() as shown below (Change indicated as #previous and #updated):

def db_in(set_id, in_file, loader=None, answer="accept", overwrite=False, dry=False):
    """
    Import annotations to the database. Supports all formats loadable by
    Prodigy.
    """
    # DB = connect() #PREVIOUS
    if not in_file.exists() or not in_file.is_file():
        prints("Not a valid input file.", in_file, exits=1, error=True)
    DB = connect() #UPDATED 
    if set_id not in DB:
        prints(
            "Can't find '{}' in database {}.".format(set_id, DB.db_name),
            "Maybe you misspelled the name or forgot to add the dataset "
            "using the `dataset` command?",
            exits=1,
            error=True,
        )
    loader = get_loader(loader, file_path=in_file)
    annotations = loader(in_file)
    annotations = [set_hashes(eg) for eg in annotations]
        added_answers = 0

If I move it further down, if loop won't work.

if set_id not in DB:
    UnboundLocalError: local variable 'DB' referenced before assignment

If I skip the if loop and move it further down, it still doesn't work.

if not dry:
    DB = connect()
    DB.add_dataset(session_id, session=True)
    DB.add_examples(annotations, datasets=[set_id, session_id])

Ah, this is not really what I meant, sorry! In any case, it looks like the main issue here is likely that your examples are so huge that the connection times out while you're adding the examples.

What's in your pre-annotated examples btw? Are you sure you want to be import single examples of 10mb? Unless you're working with images, that's pretty unusual and it'd mean that you end up with a 10mb blob of data in your database, which can easily lead to more problems later on.