How to reuse the prodigy.db to retrain the older (spacy v2) ner custom model

Hi,
Last year in June 2021 I created a ner custom model with Prodi.gy (and spacy 2.x.x.) on my windows laptop:
python -m prodigy train ner dataset,dataset_correct,dataset_correct1,dataset_correct3 en_vectors_web_lg — output C:\Users\myname\Documents\tmp_model — eval-split 0.2 — n-iter 40

I tried to upload this model to huggingface.co, but I couldnot, due to incompatibility between spacy2 used in the model vs spacy3 of the spacy-huggingface-hub. Therefore I have decided to install prodi.gy on my ubuntu laptop 22.04 to retrain the old model or rebuilt it, depending on the possibility.

I still have my .prodigy folder from my windows laptop from last year. It contains two files: prodigy.db (168 MB, 9 datasets) and prodigy.json (6B). I want to reuse this prodigy.db database to retrain or rebuilt the old model.

Can you please give suggestions on how to do it, with links to right code?

gr.
Rahul

Hi Rahul,

One way you can do this is to copy over the prodigy.db file from your old laptop into your new environment. You'd usually find that in the Prodigy home directory. By default, the prodigy.db file is essentially a SQLite database.

You can train again using the prodigy train command (be sure to check the new parameters and arguments), and it will result into a spaCy v3 model (under the hood, prodigy train calls the same commands as spacy train in v3). We highly recommend doing it this way so that it's easier to integrate with other services (e.g. Huggingface), etc.

Hi,

Thank you for your suggestion. I copied the prodigy.db file from old laptop to new environment, in the .prodigy home directory. I gave the path in prodigy.json as '/home/gebruiker/.prodigy'.
I used the train recipe on the dataset from the database but prodi.gy kills the process after initializing the pipeline.

Can you suggest what is happening here?

The steps I took are here :

(prodigy-env) (base) gebruiker@xxxxxU:~/anaconda3/envs$ python -m prodigy train /home/xxxxx/Documenten/ --ner test_dataset --base-model en_core_web_lg
:information_source: Using CPU

========================= Generating Prodigy config =========================
:information_source: Auto-generating config with spaCy
:information_source: Using config from base model
:heavy_check_mark: Generated training config

=========================== Initializing pipeline ===========================
Killed
(prodigy-env) (base) gebruiker@xxxx:~/anaconda3/envs$

Can you try running with PRODIGY_LOGGING=verbose? Something like:

PRODIGY_LOGGING=verbose python -m ...

There are many possible reasons as to why the pipeline gets killed. It can be memory or some other thing. Here are a few ways I usually debug them:

  • How large is your dataset? Can you try updating the config.cfg file and lessen the batch size? This can be a memory error.
  • It may also be some dependency error. What's the output of your python -m prodigy stats -l and python -m spacy info?

Hi,

Here is an example of using verbose:
(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ PRODIGY_LOGGING=verbose python -m prodigy train /home/gebruiker/Documenten/ --ner dataset,dataset_anon,dataset_combined --base-model en_core_web_lg
08:32:23: INIT: Setting all logging levels to 10
08:32:23: RECIPE: Calling recipe 'train'
:information_source: Using CPU

========================= Generating Prodigy config =========================
:information_source: Auto-generating config with spaCy
08:32:30: CONFIG: Using config from global prodigy.json
/home/gebruiker/.prodigy/prodigy.json

08:32:30: DB: Initializing database SQLite
08:32:30: DB: Connecting to database SQLite
:information_source: Using config from base model
:heavy_check_mark: Generated training config

=========================== Initializing pipeline ===========================
Killed

- The database is of size 170 MB, I could open the database in VS code and in Google SQliteviewer to see the datasets. The process uses config.cfg file from the base model, en_core_web_lg, with batch size 256.

(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ python -m prodigy stats -l
Traceback (most recent call last):
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
sqlite3.OperationalError: unable to open database file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3195, in execute_sql
cursor = self.cursor(commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3179, in cursor
self.connect()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3136, in connect
self._initialize_connection(self._state.conn)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
peewee.OperationalError: unable to open database file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gebruiker/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gebruiker/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/main.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 364, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/recipes/commands.py", line 46, in stats
"total_datasets": len(DB.datasets),
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/components/db.py", line 236, in datasets
return [ds.name for ds in datasets]
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 7014, in iter
self.execute()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 1927, in inner
return method(self, database, *args, **kwargs)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 1998, in execute
return self._execute(database)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2171, in _execute
cursor = database.execute(self)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3210, in execute
return self.execute_sql(sql, params, commit=commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3204, in execute_sql
self.commit()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3195, in execute_sql
cursor = self.cursor(commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3179, in cursor
self.connect()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3136, in connect
self._initialize_connection(self._state.conn)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
peewee.OperationalError: unable to open database file

(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ python -m spacy info

============================== Info about spaCy ==============================

spaCy version 3.4.1
Location /home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/spacy
Platform Linux-5.15.0-46-generic-x86_64-with-glibc2.35
Python version 3.9.12
Pipelines en_core_web_lg (3.4.0)

I think it has to do with some dependency error.
Thanks for your help.
gr. Rahul

Hi @rahul1 ,

Upon checking the error message, perhaps what we can do instead is:

  • Export the old prodigy.db dataset that you have using db-out command,
  • This in turn will produce a JSONL file that ideally you can use for other downstream purposes, but in our case, we will use that to hydrate the new prodigy.db that we have.

There might have been internal changes in the SQLite file across various versions so it might be hard to see the error. If you can still run prodigy on that old database, then you can do the steps above.

1 Like

Hi,
Unfortunately I cannot run prodigy on the old database, I get the same error:
peewee.OperationalError: unable to open database file

I have let the problem rest for a while. I can access the datasets from the database in google sqliteviewer. May be that way check each dataset and possibility download into .prodigy folder? I will check it.

Hi,
I finally managed to Export the old prodigy.db dataset that you have using db-out command to produce a JSONL file. The trick is: before executing the db-out command, give the name of the old database ('prodigy2.db' in my case) in the prodigy.json file.

In addition, I can now create a model, just by using old dataset without having to hydrate new database:
prodigy train /home/gebruiker/Documenten/ --ner dataset_old

I will look into documentation about using the jsonl file to hydrate the new database. May be the quality of the model will get better.
Thank for your advice.

regards
Rahul

How to transfer prodigyold.db (containing older prodigy annotated dataset) to new prodigy.db

Step: move the old database file to the folder .prodigy.
Check if the database name doesnot contain any spaces or symobls (otherwise you will get error in following steps). Now your .prodigy folder contains two database files (prodigy.db and prodigyold.db) and a json file (prodigy.json).

Step: change the name of the db file in the prodigy.json

{
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigyold.db",
      "path": "./.prodigy"
    }
  }
}

This step can be done within ssh terminal using vi commands (more info on internet). Else, the file can be edited in the local enviornment and then uploaded in ssh terminal, and moved to .prodigy folder.

Step: Check which datasets are there in your old database by typing following command

prodigy stats -l

The result shows number of datasets, sessions and name of the datasets as well. There is only one dataset in my case (old_dataset). A model can be created, just by using old dataset without having to hydrate new database:

prodigy train /home/gebruiker/Documenten/ - ner old_dataset

However, it is a good practice to use those datasets to export the annotated data as JSONL files as shown below.

Step: Export the old prodigyold.db dataset using db-out command to produce a JSONL file.
Beware: before executing the db-out command, give the name of the old database ('prodigyold.db' in my case) in the prodigy.json file, shown above.

prodigy db-out old_dataset > ./old_data.jsonl

Step: change the name of the db file again in the prodigy.json

{
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigy.db",
      "path": "./.prodigy"
    }
  }
}

Step: create new_dataset in the prodigy.db using the annotated old_data.jsonl
prodigy db-in new_dataset ./old_data.jsonl --rehash
You can use the new_dataset to create a model using prodigy train recipe.

1 Like