db-in error after db-out

Hi! I get the following error message as soon as I run db-in after db-out:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I created the dataset locally with db-out and wanted to import it into a VM with GPU support for e.g. train curve etc..
However, I also get the error message on my local machine when I want to import the dataset with a new name after exporting it.

Can I simply export and import datasets?

db-out:

python -m prodigy db-out correct_UC01_train > assets/correct_UC01_train.jsonl

db-in:

python -m prodigy db-in test_UC01 assets/correct_UC01_train.jsonl

Full error message:

Traceback (most recent call last):
  File "C:\Users\xxx\Miniconda3\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\xxx\Miniconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\xxx\Miniconda3\lib\site-packages\prodigy\__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "C:\Users\xxx\Miniconda3\lib\site-packages\plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "C:\Users\xxx\Miniconda3\lib\site-packages\plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Users\xxx\Miniconda3\lib\site-packages\prodigy\recipes\commands.py", line 152, in db_in
    annotations = [set_hashes(eg) for eg in annotations]
  File "C:\Users\xxx\Miniconda3\lib\site-packages\prodigy\recipes\commands.py", line 152, in <listcomp>
    annotations = [set_hashes(eg) for eg in annotations]
  File "cython_src\prodigy\components\loaders.pyx", line 140, in JSONL
  File "C:\Users\xxx\Miniconda3\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Local environment:

============================== ✨  Prodigy Stats ==============================

Version          1.11.4
Location         C:\Users\Miniconda3\lib\site-packages\prodigy
Prodigy Home     C:\Users\.prodigy
Platform         Windows-10-10.0.18362-SP0
Python Version   3.8.3
Database Name    SQLite
Database Id      sqlite
Total Datasets   9
Total Sessions   53

Thank you!

1 Like

Hi! That's strange and I wonder if somehow the encoding changed when you transferred the file to your GPU machine? :thinking:

In general, we'd recommend using db-out to export your spaCy training corpus and then upload that to the machine you want to train with. (Or, if you're using spaCy v3 and want it to be super elegant, make it a spaCy project and use push and pull to upload to/download from a remote storage.) The advantage of exporting the training corpus is that your GPU machine won't have to depend on Prodigy and you can make sure that the data you create and work with locally is always the same as the data you work with remotely.

Hi @ines! Thank you for the super quick reply!
I have the same problem on my local (Prodigy) machine. So just db-out and db-in -> error message.

The training on the GPU machine works fine. I only transfer the binary spaCy files after data-to-spacy for the training and evaluation. I did everything with spaCy projects/W&B and it is so cool to work with, Thank you so much!

I was just thinking about running the train-curve on the remote machine as well. I don't want to annotate, just calculate. Therefore db-out and db-in.

That's very strange because we use the exact same method and configuration to read and write JSON within Prodigy – so I don't understand how a file written out could not be loaded back in :thinking:

I wonder if it's the > on Windows? Could you try specifying the output directory as the second argument?

python -m prodigy db-out correct_UC01_train assets

Another workaround would be to just call into Prodigy's database from Python and get a list of dictionaries, that you can then save out yourself. This would also let you debug the process if it turns out you end up with a similar problem:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("correct_UC01_train")
1 Like

Good catch :blush: That works:

python -m prodigy db-out correct_UC01_train assets

It really looks like the problem is the > on windows. As soon as I test it with >, the import fails.

Thank you for the great support! :heart:

2 Likes

Hi,
I had the same problem, which was solved by avoiding ">" in Windows Powershell. But Prodigy fails when I want to run textcat.teach on my imported dataset:

python -m prodigy db-out email_labels .
prodigy db-in skade_email_annotation ./email_labels.jsonl
prodigy textcat.teach skade_email_annotation ".\nb_core_news_sm\nb_core_news_sm-3.2.0" - --label "Economy, Technology"

This results in the following errormessage:

✘ Failed to load task (invalid JSON on line 1)
much always means that there's something wrong with this line
of JSON and Python can't load it. Even if you think it's correct, something must
confuse it. Try calling json.loads(line) on each line or use a JSON linter.

The top of the file looks like this:

{"text":"This is about money","label":"ECONOMY","_input_hash":1680477043,"_task_hash":-1989002408,"options":[{"id":"TECHNOLOGY","text":"TECHNOLOGY"},{"id":"ECONOMY","text":"ECONOMY"}],"_view_id":"choice","accept":["TECHNOLOGY","ECONOMY"],"config":{"choice_style":"multiple"},"answer":"ignore","_timestamp":1643617059}
{"text":"This is about robot science","label":"TECHNOLOGY","_input_hash":2660767,"_task_hash":-1128305112,"options":[{"id":"TECHNOLOGY","text":"TECHNOLOGY"},{"id":"ECONOMY","text":"ECONOMY"}],"_view_id":"choice","config":{"choice_style":"multiple"},"accept":[],"answer":"accept","_timestamp":1643617063}

Could there be some other Windows Powershell issues that I don't know about, or do you have another suggestion to why this doesn't work?
Thanks,
Anders

Hi @Lingo ,

It seems that this command:

prodigy textcat.teach skade_email_annotation ".\nb_core_news_sm\nb_core_news_sm-3.2.0" - --label "Economy, Technology"

Loads from standard input, but we're not streaming anything to it. Perhaps we can explicitly set the source path to .\email_labels.jsonl instead? Something like this (note the - vs adding the actual path):

prodigy textcat.teach skade_email_annotation ".\some-model" ".\email_labels.jsonl" ...
1 Like