How to reuse the prodigy.db to retrain the older (spacy v2) ner custom model

Hi,
Last year in June 2021 I created a ner custom model with Prodi.gy (and spacy 2.x.x.) on my windows laptop:
python -m prodigy train ner dataset,dataset_correct,dataset_correct1,dataset_correct3 en_vectors_web_lg — output C:\Users\myname\Documents\tmp_model — eval-split 0.2 — n-iter 40

I tried to upload this model to huggingface.co, but I couldnot, due to incompatibility between spacy2 used in the model vs spacy3 of the spacy-huggingface-hub. Therefore I have decided to install prodi.gy on my ubuntu laptop 22.04 to retrain the old model or rebuilt it, depending on the possibility.

I still have my .prodigy folder from my windows laptop from last year. It contains two files: prodigy.db (168 MB, 9 datasets) and prodigy.json (6B). I want to reuse this prodigy.db database to retrain or rebuilt the old model.

Can you please give suggestions on how to do it, with links to right code?

gr.
Rahul

Hi Rahul,

One way you can do this is to copy over the prodigy.db file from your old laptop into your new environment. You'd usually find that in the Prodigy home directory. By default, the prodigy.db file is essentially a SQLite database.

You can train again using the prodigy train command (be sure to check the new parameters and arguments), and it will result into a spaCy v3 model (under the hood, prodigy train calls the same commands as spacy train in v3). We highly recommend doing it this way so that it's easier to integrate with other services (e.g. Huggingface), etc.

Hi,

Thank you for your suggestion. I copied the prodigy.db file from old laptop to new environment, in the .prodigy home directory. I gave the path in prodigy.json as '/home/gebruiker/.prodigy'.
I used the train recipe on the dataset from the database but prodi.gy kills the process after initializing the pipeline.

Can you suggest what is happening here?

The steps I took are here :

(prodigy-env) (base) gebruiker@xxxxxU:~/anaconda3/envs$ python -m prodigy train /home/xxxxx/Documenten/ --ner test_dataset --base-model en_core_web_lg
:information_source: Using CPU

========================= Generating Prodigy config =========================
:information_source: Auto-generating config with spaCy
:information_source: Using config from base model
:heavy_check_mark: Generated training config

=========================== Initializing pipeline ===========================
Killed
(prodigy-env) (base) gebruiker@xxxx:~/anaconda3/envs$

Can you try running with PRODIGY_LOGGING=verbose? Something like:

PRODIGY_LOGGING=verbose python -m ...

There are many possible reasons as to why the pipeline gets killed. It can be memory or some other thing. Here are a few ways I usually debug them:

  • How large is your dataset? Can you try updating the config.cfg file and lessen the batch size? This can be a memory error.
  • It may also be some dependency error. What's the output of your python -m prodigy stats -l and python -m spacy info?

Hi,

Here is an example of using verbose:
(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ PRODIGY_LOGGING=verbose python -m prodigy train /home/gebruiker/Documenten/ --ner dataset,dataset_anon,dataset_combined --base-model en_core_web_lg
08:32:23: INIT: Setting all logging levels to 10
08:32:23: RECIPE: Calling recipe 'train'
:information_source: Using CPU

========================= Generating Prodigy config =========================
:information_source: Auto-generating config with spaCy
08:32:30: CONFIG: Using config from global prodigy.json
/home/gebruiker/.prodigy/prodigy.json

08:32:30: DB: Initializing database SQLite
08:32:30: DB: Connecting to database SQLite
:information_source: Using config from base model
:heavy_check_mark: Generated training config

=========================== Initializing pipeline ===========================
Killed

- The database is of size 170 MB, I could open the database in VS code and in Google SQliteviewer to see the datasets. The process uses config.cfg file from the base model, en_core_web_lg, with batch size 256.

(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ python -m prodigy stats -l
Traceback (most recent call last):
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
sqlite3.OperationalError: unable to open database file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3195, in execute_sql
cursor = self.cursor(commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3179, in cursor
self.connect()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3136, in connect
self._initialize_connection(self._state.conn)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
peewee.OperationalError: unable to open database file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gebruiker/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gebruiker/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/main.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 364, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/recipes/commands.py", line 46, in stats
"total_datasets": len(DB.datasets),
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/prodigy/components/db.py", line 236, in datasets
return [ds.name for ds in datasets]
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 7014, in iter
self.execute()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 1927, in inner
return method(self, database, *args, **kwargs)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 1998, in execute
return self._execute(database)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2171, in _execute
cursor = database.execute(self)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3210, in execute
return self.execute_sql(sql, params, commit=commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3204, in execute_sql
self.commit()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3195, in execute_sql
cursor = self.cursor(commit)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3179, in cursor
self.connect()
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3136, in connect
self._initialize_connection(self._state.conn)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 2970, in exit
reraise(new_type, new_type(exc_value, *exc_args), traceback)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 191, in reraise
raise value.with_traceback(tb)
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3133, in connect
self._state.set_connection(self._connect())
File "/home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/peewee.py", line 3478, in _connect
conn = sqlite3.connect(self.database, timeout=self._timeout,
peewee.OperationalError: unable to open database file

(prodigy-env) (base) gebruiker@xxxxx:~/anaconda3/envs$ python -m spacy info

============================== Info about spaCy ==============================

spaCy version 3.4.1
Location /home/gebruiker/anaconda3/envs/prodigy-env/lib/python3.9/site-packages/spacy
Platform Linux-5.15.0-46-generic-x86_64-with-glibc2.35
Python version 3.9.12
Pipelines en_core_web_lg (3.4.0)

I think it has to do with some dependency error.
Thanks for your help.
gr. Rahul

Hi @rahul1 ,

Upon checking the error message, perhaps what we can do instead is:

  • Export the old prodigy.db dataset that you have using db-out command,
  • This in turn will produce a JSONL file that ideally you can use for other downstream purposes, but in our case, we will use that to hydrate the new prodigy.db that we have.

There might have been internal changes in the SQLite file across various versions so it might be hard to see the error. If you can still run prodigy on that old database, then you can do the steps above.