Hello, I'm really struggling to understand what is wrong with our set up and any help is appreciated.
Background
We have 2 machines set up, one is being used by analysts to perform annotations on a dataset. The other machine is being used by our data scientist to train the models. We have them separated due to resource constraints.
The Datascientist has been using an annotated dataset from the analyst machine, trained an ner model that seems to perform fairly well, and we want to use it. We transfered the model from the Datascience machine to the Analyst machine so we could ner.correct it with a new dataset that we just pulled.
However, the new dataset and the new model do not seem to like each other at all. The new model works with the old dataset, and the new dataset works with the old model. But we can not get this new model to work with the new dataset at all.
ubuntu@ip-xxx-xxx-xxx-xxx:~/xxxx$ PRODIGY_LOGGING=verbose prodigy ner.correct ner_regs_files ./output/model-best/ files_oct27_export.jsonl --label ISSUING_AUTHORITY,JURISDICTION,DATE,CANNABIS,LEGAL,ADDRESS,HEMP
14:06:46: INIT: Setting all logging levels to 10
14:06:46: RECIPE: Calling recipe 'ner.correct'
Using 7 label(s): ISSUING_AUTHORITY, JURISDICTION, DATE, CANNABIS, LEGAL,
ADDRESS, HEMP
14:06:46: RECIPE: Starting recipe ner.correct
{'dataset': 'ner_regs_files', 'source': 'files_oct27_export.jsonl', 'loader': None, 'label': ['ISSUING_AUTHORITY', 'JURISDICTION', 'DATE', 'CANNABIS', 'LEGAL', 'ADDRESS', 'HEMP'], 'update': False, 'exclude': None, 'unsegmented': False, 'component': 'ner', 'spacy_model': './output/model-best/'}
14:06:53: RECIPE: Annotating with 7 labels
['ISSUING_AUTHORITY', 'JURISDICTION', 'DATE', 'CANNABIS', 'LEGAL', 'ADDRESS', 'HEMP']
14:06:53: LOADER: Using file extension 'jsonl' to find loader
files_oct27_export.jsonl
14:06:53: LOADER: Loading stream from jsonl
14:06:53: LOADER: Rehashing stream
14:06:53: CONFIG: Using config from global prodigy.json
/home/ubuntu/.prodigy/prodigy.json
14:06:53: CONFIG: Using config from working dir
/home/ubuntu/xxxx/prodigy.json
14:06:53: VALIDATE: Validating components returned by recipe
14:06:53: CONTROLLER: Initialising from recipe
{'before_db': None, 'config': {'lang': 'en', 'labels': ['ISSUING_AUTHORITY', 'JURISDICTION', 'DATE', 'CANNABIS', 'LEGAL', 'ADDRESS', 'HEMP'], 'exclude_by': 'input', 'auto_count_stream': True, 'dataset': 'ner_regs_files', 'recipe_name': 'ner.correct', 'host': 'REDACTED-IP-ADDRESS', 'port': 8080, 'total_examples_target': 1000}, 'dataset': 'ner_regs_files', 'db': True, 'exclude': None, 'get_session_id': None, 'metrics': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x7f461ad67730>, 'self': <prodigy.core.Controller object at 0x7f4619f17190>, 'stream': <generator object correct.<locals>.make_tasks at 0x7f4618eca2e0>, 'update': None, 'validate_answer': None, 'view_id': 'ner_manual'}
14:06:53: VALIDATE: Creating validator for view ID 'ner_manual'
14:06:53: VALIDATE: Validating Prodigy and recipe config
14:06:53: CONFIG: Using config from global prodigy.json
/home/ubuntu/.prodigy/prodigy.json
14:06:53: CONFIG: Using config from working dir
/home/ubuntu/xxx/prodigy.json
14:06:53: DB: Initializing database SQLite
14:06:53: DB: Connecting to database SQLite
14:06:53: DB: Creating dataset '2022-10-28_14-06-53'
{'created': datetime.datetime(2022, 9, 30, 17, 38, 25)}
14:06:53: FEED: Initializing from controller
{'auto_count_stream': False, 'batch_size': 10, 'dataset': 'ner_regs_files', 'db': <prodigy.components.db.Database object at 0x7f4619f01c10>, 'exclude': ['ner_regs_files'], 'exclude_by': 'input', 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.Feed object at 0x7f4611936b50>, 'stream': <generator object correct.<locals>.make_tasks at 0x7f4618eca2e0>, 'target_total_annotated': 1000, 'timeout_seconds': 3600, 'total_annotated': 1204, 'total_annotated_by_session': Counter({'ner_regs_files-sesh1': 498, 'ner_regs_files-sesh2': 271, 'ner_regs_files-sesh3': 184, 'ner_regs_files-sesh4': 130, 'ner_regs_files-sesh5': 88, 'ner_regs_files-sesh6': 25, 'ner_regs_files-sesh7': 1}), 'validator': <prodigy.components.validate.Validator object at 0x7f467857e0d0>, 'view_id': 'ner_manual'}
14:06:53: PREPROCESS: Tokenizing examples (running tokenizer only)
14:06:53: PREPROCESS: Splitting sentences
{'batch_size': 32, 'min_length': None, 'nlp': <spacy.lang.en.English object at 0x7f4619f01c70>, 'no_sents_warned': False, 'stream': <generator object at 0x7f4618f3ad60>, 'text_key': 'text'}
14:06:53: CONFIG: Using config from global prodigy.json
/home/ubuntu/.prodigy/prodigy.json
14:06:53: CONFIG: Using config from working dir
/home/ubuntu/xxx/prodigy.json
14:06:53: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f4618f3ac20>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f461ad671f0>>, 'warn_threshold': 0.4}
14:06:53: FILTER: Filtering out empty examples for key 'text'
Killed
It seems to get hung up at the very end but I can't seem to figure out what/why this is happening. I've even reduced the new dataset to a single object so it would only load one small text extract but it just doesn't seem to work.
I am, unfortunately, not a data scientist myself and am mostly involved with the data pipeline side of things. This dataset file looks to be pretty much identical, just being new data. I can't read much from the logging to know where the issue is happening and what I can do to resolve this, or where to look, or what to change to make this work. It doesn't look to be a hardware constraint at all either.