Trained model not working on new dataset

Hello, I'm really struggling to understand what is wrong with our set up and any help is appreciated.

Background
We have 2 machines set up, one is being used by analysts to perform annotations on a dataset. The other machine is being used by our data scientist to train the models. We have them separated due to resource constraints.

The Datascientist has been using an annotated dataset from the analyst machine, trained an ner model that seems to perform fairly well, and we want to use it. We transfered the model from the Datascience machine to the Analyst machine so we could ner.correct it with a new dataset that we just pulled.

However, the new dataset and the new model do not seem to like each other at all. The new model works with the old dataset, and the new dataset works with the old model. But we can not get this new model to work with the new dataset at all.

ubuntu@ip-xxx-xxx-xxx-xxx:~/xxxx$ PRODIGY_LOGGING=verbose prodigy ner.correct ner_regs_files ./output/model-best/ files_oct27_export.jsonl --label ISSUING_AUTHORITY,JURISDICTION,DATE,CANNABIS,LEGAL,ADDRESS,HEMP
14:06:46: INIT: Setting all logging levels to 10
14:06:46: RECIPE: Calling recipe 'ner.correct'
Using 7 label(s): ISSUING_AUTHORITY, JURISDICTION, DATE, CANNABIS, LEGAL,
ADDRESS, HEMP
14:06:46: RECIPE: Starting recipe ner.correct
{'dataset': 'ner_regs_files', 'source': 'files_oct27_export.jsonl', 'loader': None, 'label': ['ISSUING_AUTHORITY', 'JURISDICTION', 'DATE', 'CANNABIS', 'LEGAL', 'ADDRESS', 'HEMP'], 'update': False, 'exclude': None, 'unsegmented': False, 'component': 'ner', 'spacy_model': './output/model-best/'}
14:06:53: RECIPE: Annotating with 7 labels
['ISSUING_AUTHORITY', 'JURISDICTION', 'DATE', 'CANNABIS', 'LEGAL', 'ADDRESS', 'HEMP']
14:06:53: LOADER: Using file extension 'jsonl' to find loader
files_oct27_export.jsonl
14:06:53: LOADER: Loading stream from jsonl
14:06:53: LOADER: Rehashing stream
14:06:53: CONFIG: Using config from global prodigy.json
/home/ubuntu/.prodigy/prodigy.json
14:06:53: CONFIG: Using config from working dir
/home/ubuntu/xxxx/prodigy.json
14:06:53: VALIDATE: Validating components returned by recipe
14:06:53: CONTROLLER: Initialising from recipe
{'before_db': None, 'config': {'lang': 'en', 'labels': ['ISSUING_AUTHORITY', 'JURISDICTION', 'DATE', 'CANNABIS', 'LEGAL', 'ADDRESS', 'HEMP'], 'exclude_by': 'input', 'auto_count_stream': True, 'dataset': 'ner_regs_files', 'recipe_name': 'ner.correct', 'host': 'REDACTED-IP-ADDRESS', 'port': 8080, 'total_examples_target': 1000}, 'dataset': 'ner_regs_files', 'db': True, 'exclude': None, 'get_session_id': None, 'metrics': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x7f461ad67730>, 'self': <prodigy.core.Controller object at 0x7f4619f17190>, 'stream': <generator object correct.<locals>.make_tasks at 0x7f4618eca2e0>, 'update': None, 'validate_answer': None, 'view_id': 'ner_manual'}
14:06:53: VALIDATE: Creating validator for view ID 'ner_manual'
14:06:53: VALIDATE: Validating Prodigy and recipe config
14:06:53: CONFIG: Using config from global prodigy.json
/home/ubuntu/.prodigy/prodigy.json
14:06:53: CONFIG: Using config from working dir
/home/ubuntu/xxx/prodigy.json
14:06:53: DB: Initializing database SQLite
14:06:53: DB: Connecting to database SQLite
14:06:53: DB: Creating dataset '2022-10-28_14-06-53'
{'created': datetime.datetime(2022, 9, 30, 17, 38, 25)}
14:06:53: FEED: Initializing from controller
{'auto_count_stream': False, 'batch_size': 10, 'dataset': 'ner_regs_files', 'db': <prodigy.components.db.Database object at 0x7f4619f01c10>, 'exclude': ['ner_regs_files'], 'exclude_by': 'input', 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.Feed object at 0x7f4611936b50>, 'stream': <generator object correct.<locals>.make_tasks at 0x7f4618eca2e0>, 'target_total_annotated': 1000, 'timeout_seconds': 3600, 'total_annotated': 1204, 'total_annotated_by_session': Counter({'ner_regs_files-sesh1': 498, 'ner_regs_files-sesh2': 271, 'ner_regs_files-sesh3': 184, 'ner_regs_files-sesh4': 130, 'ner_regs_files-sesh5': 88, 'ner_regs_files-sesh6': 25, 'ner_regs_files-sesh7': 1}), 'validator': <prodigy.components.validate.Validator object at 0x7f467857e0d0>, 'view_id': 'ner_manual'}
14:06:53: PREPROCESS: Tokenizing examples (running tokenizer only)
14:06:53: PREPROCESS: Splitting sentences
{'batch_size': 32, 'min_length': None, 'nlp': <spacy.lang.en.English object at 0x7f4619f01c70>, 'no_sents_warned': False, 'stream': <generator object at 0x7f4618f3ad60>, 'text_key': 'text'}
14:06:53: CONFIG: Using config from global prodigy.json
/home/ubuntu/.prodigy/prodigy.json
14:06:53: CONFIG: Using config from working dir
/home/ubuntu/xxx/prodigy.json
14:06:53: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f4618f3ac20>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f461ad671f0>>, 'warn_threshold': 0.4}
14:06:53: FILTER: Filtering out empty examples for key 'text'
Killed

It seems to get hung up at the very end but I can't seem to figure out what/why this is happening. I've even reduced the new dataset to a single object so it would only load one small text extract but it just doesn't seem to work.

I am, unfortunately, not a data scientist myself and am mostly involved with the data pipeline side of things. This dataset file looks to be pretty much identical, just being new data. I can't read much from the logging to know where the issue is happening and what I can do to resolve this, or where to look, or what to change to make this work. It doesn't look to be a hardware constraint at all either.

hi @WKamptner!

Thanks for your question and the background.

This seems a bit tricky as nothing stands out right now. Maybe we take a step back and make sure both machines have identical setups - are you using virtual environments on both machines and the same Python versions?

You can run pip freeze on both machines to print Python library versions.

Related, can you confirm you have the same version of prodigy on both machines?

You can check by running:

python -m prodigy stats

Also check your spaCy version too:

python -m spacy info

If they are consistent, like I did above, can you add python -m to the start of all commands on both machines (e.g., ner.correct) ? If python -m doesn't work, you may need to run python3 -m. You can also set up an alias.

Also confirm you have the same (or at least consistent) prodigy.json files on both machines.

Since you have an existing model, you may want to check what version of spaCy used to develop the model. You can find it in your model's folder's meta.json file with "spacy_version".

If you can confirm all versions are inconsistent then, I would recommend testing on a common dataset. With the data and access to the identical machine, we can't replicate it so to iron out problems it's best to have a replicable dataset.

Hope this helps and let us know when you have an update!

Analyst Machine

ubuntu@xxxxxxxxx:~/fyllo-prodigy$ python -m prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.11.8                        
Location         /home/ubuntu/anaconda3/lib/python3.9/site-packages/prodigy
Prodigy Home     /home/ubuntu/.prodigy         
Platform         Linux-5.15.0-1022-aws-x86_64-with-glibc2.35
Python Version   3.9.12                        
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   11                            
Total Sessions   104 

Datascientist Machine

(base) ubuntu@xxxxxxxxx:~$ python -m prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.11.8                        
Location         /home/ubuntu/anaconda3/lib/python3.9/site-packages/prodigy
Prodigy Home     /home/ubuntu/.prodigy         
Platform         Linux-5.15.0-1022-aws-x86_64-with-glibc2.31
Python Version   3.9.13                        
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   3                             
Total Sessions   3  

Both machines running Spacy version 3.4.2

Both prodigy.json files only contain a host to point towards (we are running on EC2 instances), so assuming all other undeclared values are default

Both meta.json files say

spacy_version:">=3.4.2,<3.5.0"

for their spacy_version.

I am not part of the Data science team so I am not sure what they may have done on the model training machine, but are there any steps that they may have done that probably wouldn't have been done to the Analyst machine (which does not have model training happening)? To move the model over, we just ssh'd from the Datascience machine to the Analyst machine and recursively copied over specific folders. Is there any chance that, say, a hidden folder may contain important information? I will look into testing on a common dataset and see what happens. The only other thing I can imagine is that the Datascience Machine has a dedicated GPU on it that was used for training, though I was told that the gpu was not necessary for the model they were using.

Not that I'm aware.

In theory moving the full folder for the spaCy pipeline should work on any machine. What's weird though is if some file was missed when loading the model, you'd get more of an error than just Killed.

For example, this shows what happens if there's a problem with the meta.json file in a spaCy pipeline:

Can you check if your "Datascientist" machine is running into any memory problems?

If not, on the "Datascientist" machine, let's forget Prodigy for a second and see if you can load/run a sample sentence:

import spacy
nlp = spacy.load("path/to/pipeline")
doc = nlp("This is a sample sentence")

# confirm you can run your model on the Datascientist machine
for entity in doc.ents:
    print(entity.text, entity.label_)

GPU could be another factor! Were they used to train the model and/or did you use spacy-transformers anywhere in your workflow?

GPU's can yield more accurate models but can also be trickier to handle. Here's a FAQ on GPU's in spaCy.