How to reuse the prodigy.db to retrain the older (spacy v2) ner custom model

rahul1 · August 14, 2022, 2:10pm

Hi,
Last year in June 2021 I created a ner custom model with Prodi.gy (and spacy 2.x.x.) on my windows laptop:
python -m prodigy train ner dataset,dataset_correct,dataset_correct1,dataset_correct3 en_vectors_web_lg — output C:\Users\myname\Documents\tmp_model — eval-split 0.2 — n-iter 40

I tried to upload this model to huggingface.co, but I couldnot, due to incompatibility between spacy2 used in the model vs spacy3 of the spacy-huggingface-hub. Therefore I have decided to install prodi.gy on my ubuntu laptop 22.04 to retrain the old model or rebuilt it, depending on the possibility.

I still have my .prodigy folder from my windows laptop from last year. It contains two files: prodigy.db (168 MB, 9 datasets) and prodigy.json (6B). I want to reuse this prodigy.db database to retrain or rebuilt the old model.

Can you please give suggestions on how to do it, with links to right code?

gr.
Rahul

ljvmiranda921 · August 16, 2022, 7:39am

Hi Rahul,

One way you can do this is to copy over the prodigy.db file from your old laptop into your new environment. You'd usually find that in the Prodigy home directory. By default, the prodigy.db file is essentially a SQLite database.

You can train again using the prodigy train command (be sure to check the new parameters and arguments), and it will result into a spaCy v3 model (under the hood, prodigy train calls the same commands as spacy train in v3). We highly recommend doing it this way so that it's easier to integrate with other services (e.g. Huggingface), etc.

rahul1 · August 17, 2022, 1:10pm

Hi,

Thank you for your suggestion. I copied the prodigy.db file from old laptop to new environment, in the .prodigy home directory. I gave the path in prodigy.json as '/home/gebruiker/.prodigy'.
I used the train recipe on the dataset from the database but prodi.gy kills the process after initializing the pipeline.

Can you suggest what is happening here?

The steps I took are here :

(prodigy-env) (base) gebruiker@xxxxxU:~/anaconda3/envs$ python -m prodigy train /home/xxxxx/Documenten/ --ner test_dataset --base-model en_core_web_lg
Using CPU

========================= Generating Prodigy config =========================
Auto-generating config with spaCy
Using config from base model
Generated training config

=========================== Initializing pipeline ===========================
Killed
(prodigy-env) (base) gebruiker@xxxx:~/anaconda3/envs$

ljvmiranda921 · August 18, 2022, 4:48am

Can you try running with PRODIGY_LOGGING=verbose? Something like:

PRODIGY_LOGGING=verbose python -m ...

There are many possible reasons as to why the pipeline gets killed. It can be memory or some other thing. Here are a few ways I usually debug them:

How large is your dataset? Can you try updating the config.cfg file and lessen the batch size? This can be a memory error.
It may also be some dependency error. What's the output of your python -m prodigy stats -l and python -m spacy info?

rahul1 · August 19, 2022, 6:49am

Hi,

Here is an example of using verbose:
(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ PRODIGY_LOGGING=verbose python -m prodigy train /home/gebruiker/Documenten/ --ner dataset,dataset_anon,dataset_combined --base-model en_core_web_lg
08:32:23: INIT: Setting all logging levels to 10
08:32:23: RECIPE: Calling recipe 'train'
Using CPU

========================= Generating Prodigy config =========================
Auto-generating config with spaCy
08:32:30: CONFIG: Using config from global prodigy.json
/home/gebruiker/.prodigy/prodigy.json

08:32:30: DB: Initializing database SQLite
08:32:30: DB: Connecting to database SQLite
Using config from base model
Generated training config

=========================== Initializing pipeline ===========================
Killed

- The database is of size 170 MB, I could open the database in VS code and in Google SQliteviewer to see the datasets. The process uses config.cfg file from the base model, en_core_web_lg, with batch size 256.

(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ python -m prodigy stats -l
Traceback (most recent call last):
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
sqlite3.OperationalError: unable to open database file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3195, in execute_sql
cursor = self.cursor(commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3179, in cursor
self.connect()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3136, in connect
self._initialize_connection(self._state.conn)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
peewee.OperationalError: unable to open database file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gebruiker/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gebruiker/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/main.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 364, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/plac_core.py", line 232, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/recipes/commands.py", line 46, in stats
"total_datasets": len(DB.datasets),
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/components/db.py", line 236, in datasets
return [ds.name for ds in datasets]
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 7014, in iter
self.execute()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 1927, in inner
return method(self, database, *args, **kwargs)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 1998, in execute
return self._execute(database)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2171, in _execute
cursor = database.execute(self)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3210, in execute
return self.execute_sql(sql, params, commit=commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3204, in execute_sql
self.commit()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3195, in execute_sql
cursor = self.cursor(commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3179, in cursor
self.connect()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3136, in connect
self._initialize_connection(self._state.conn)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
peewee.OperationalError: unable to open database file

(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ python -m spacy info

============================== Info about spaCy ==============================

spaCy version 3.4.1
Location /home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/spacy
Platform Linux-5.15.0-46-generic-x86_64-with-glibc2.35
Python version 3.9.12
Pipelines en_core_web_lg (3.4.0)

I think it has to do with some dependency error.
Thanks for your help.
gr. Rahul

ljvmiranda921 · September 8, 2022, 2:33am

Hi @rahul1 ,

Upon checking the error message, perhaps what we can do instead is:

Export the old prodigy.db dataset that you have using db-out command,
This in turn will produce a JSONL file that ideally you can use for other downstream purposes, but in our case, we will use that to hydrate the new prodigy.db that we have.

There might have been internal changes in the SQLite file across various versions so it might be hard to see the error. If you can still run prodigy on that old database, then you can do the steps above.

rahul1 · November 11, 2022, 2:54pm

Hi,
Unfortunately I cannot run prodigy on the old database, I get the same error:
peewee.OperationalError: unable to open database file

I have let the problem rest for a while. I can access the datasets from the database in google sqliteviewer. May be that way check each dataset and possibility download into .prodigy folder? I will check it.

rahul1 · November 13, 2022, 6:57am

Hi,
I finally managed to Export the old prodigy.db dataset that you have using db-out command to produce a JSONL file. The trick is: before executing the db-out command, give the name of the old database ('prodigy2.db' in my case) in the prodigy.json file.

In addition, I can now create a model, just by using old dataset without having to hydrate new database:
prodigy train /home/gebruiker/Documenten/ --ner dataset_old

I will look into documentation about using the jsonl file to hydrate the new database. May be the quality of the model will get better.
Thank for your advice.

regards
Rahul

rahul1 · December 5, 2022, 5:38am

How to transfer prodigyold.db (containing older prodigy annotated dataset) to new prodigy.db

Step: move the old database file to the folder .prodigy.
Check if the database name doesnot contain any spaces or symobls (otherwise you will get error in following steps). Now your .prodigy folder contains two database files (prodigy.db and prodigyold.db) and a json file (prodigy.json).

Step: change the name of the db file in the prodigy.json

{
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigyold.db",
      "path": "./.prodigy"
    }
  }
}

This step can be done within ssh terminal using vi commands (more info on internet). Else, the file can be edited in the local enviornment and then uploaded in ssh terminal, and moved to .prodigy folder.

Step: Check which datasets are there in your old database by typing following command

prodigy stats -l

The result shows number of datasets, sessions and name of the datasets as well. There is only one dataset in my case (old_dataset). A model can be created, just by using old dataset without having to hydrate new database:

prodigy train /home/gebruiker/Documenten/ - ner old_dataset

However, it is a good practice to use those datasets to export the annotated data as JSONL files as shown below.

Step: Export the old prodigyold.db dataset using db-out command to produce a JSONL file.
Beware: before executing the db-out command, give the name of the old database ('prodigyold.db' in my case) in the prodigy.json file, shown above.

prodigy db-out old_dataset > ./old_data.jsonl

Step: change the name of the db file again in the prodigy.json

{
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigy.db",
      "path": "./.prodigy"
    }
  }
}

Step: create new_dataset in the prodigy.db using the annotated old_data.jsonl
prodigy db-in new_dataset ./old_data.jsonl --rehash
You can use the new_dataset to create a model using prodigy train recipe.

Topic		Replies	Views
Trained model location path usage , ner	5	575	May 11, 2022
Can I replicate "prodigy train --ner ds_<dataset_name> ./models --eval-split 0.25 -L" within Python? ner , spacy	1	279	October 19, 2023
NER prodigy train with existing model usage , ner , spacy , solved	7	793	September 28, 2020
Help updating spaCy v2 model usage , spacy	5	381	December 15, 2021
Further train NER model from existing Model usage , ner , solved , training	1	586	January 25, 2022

How to reuse the prodigy.db to retrain the older (spacy v2) ner custom model

Related topics