I am trying to export NER annotations in Cyrillic, that were annotated by several annotators and then reviewed and saved in a separate dataset, but I get the same unicode related error. I don’t have that issue if I try to export any single annotator’s work, so it seems that something went wrong while doing the review
.
Tried several things:
- data-to-spacy
prodigy data-to-spacy ./corpus --ner ner_news_final --eval-split 0.2 --base-model bg_model --lang bg --verbose
- db-out
prodigy db-out ner_news_final > ./ner_news_final.jsonl
- even printing returns the same error
prodigy print-dataset ner_news_final | less -r
This is the error for all attempts:
============================== Generating data ==============================
Components: ner
Merging training and evaluation data for 1 components
Traceback (most recent call last):
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/__main__.py", line 63, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 872, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/train.py", line 481, in data_to_spacy
train_docs, dev_docs, pipes = merge_data(
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 139, in merge_data
corpus = create_merged_corpus(**readers, eval_split=eval_split)
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 857, in create_merged_corpus
data[reader_name] = reader(
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 978, in read_ner_annotations
examples, eval_examples = get_train_eval_examples(
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 927, in get_train_eval_examples
examples = load_examples(DB, datasets)
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/recipes/data_utils.py", line 156, in load_examples
examples = db.get_dataset_examples(set_id) or []
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/components/db.py", line 589, in get_dataset_examples
examples = list(self.iter_dataset_examples(name, session=session))
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/components/db.py", line 612, in iter_dataset_examples
yield eg.load()
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/prodigy/components/db.py", line 189, in load
return cast(Dict[str, Any], srsly.json_loads(content))
File "/Users/ivo/miniconda3/envs/prodigy_m1/lib/python3.9/site-packages/srsly/_json_api.py", line 39, in json_loads
return ujson.loads(data)
ValueError: Unterminated unicode escape sequence when decoding 'string'
I also tried connecting to my DB and by iterating through each row I can narrow it down to around the 32nd example in the dataset, when I get the same error returned:
from prodigy.components.db import connect
db = connect(
db_id="mysql",
db_settings={
"user": "#########",
"password": "#########",
"host": "#########",
"port": "#########",
"database": "#########",
"ssl": {"ssl": {"ssl-ca": "certificate.crt.pem"}},
},
)
dataset = db.iter_dataset_examples("ner_news_final")
I’m on Mac M1, tried both arm64 and x86 environments. Prodigy is version 1.13.1.
Is there something I can do on my end to work around this issue?
If not is there any way I can drop the first 32 values from my dataset and save it in a new one and export from there? Can’t figure out how to filter the datasets manually and I was hoping to avoid going through the whole review process again (~2000 examples)
Thanks,
Ivo