db-out producing json files with extra spaces between each character...


I created a new annotated dataset via the following command:

python -m prodigy ner.manual company_ned blank:en .\all_companies.txt --label ORG,LOCATION,DEPARTMENT,DIVISION,PRODUCT,BRAND,POSITION,LEGAL,OTHER

I annotated about 200 of these, then I wanted to add a couple of labels, leverage the baseline of the SpaCy NER model, and go back over those initial annotations to add extra tags where needed:

python -m prodigy ner.correct company_ned en_core_web_sm .\all_companies.txt --label ORG,LOCATION,DEPARTMENT,DIVISION,PRODUCT,BRAND,POSITION,LEGAL,OTHER,TYPE,SUBSIDIARY

This didn't work--the framework refused to revisit the original annotations to correct them. I did some reading and figured out that I am supposed to output the labels so far, and then read them back in as an input data set in order to revisit them. Ok, so then I did:

python -m prodigy db-out company_ned > company_ned_1.jsonl

The json-lines file produced can't be pasted here as it contains interspersed non-printable 'space' characters. However, when trying to load it into prodicgy, I get the following error:

Task exception was never retrieved
future: <Task finished coro=<RequestResponseCycle.run_asgi() done, defined at C:\Users\james\venv\science\science\lib\site-packages\uvicorn\protocols\http\h11_impl.py:383> exception=UnicodeDecodeError('utf-8', b'\xff\xfe{\x00"\x00t\x00e\x00x\x00t\x00"\x00:\x00"\x00S\x00S\x00C\x00E\x00T\x00,\x00 \x00B\x00h\x00i\x00l\x00a\x00i\x00"\x00,\x00"\x00_\x00i\x00n\x00p\x00u\x00t\x00_\x00h\x00a\x00s\x00h\x00"\x00:\x00-\x006\x006\x002\x000\x003\x003\x000\x007\x007\x00,\x00"\x00_\x00t\x00a\x00s\x00k\x00_\x00h\x00a\x00s\x00h\x00"\x00:\x00-\x001\x002\x003\x002\x007\x002\x007\x003\x006\x00,\x00"\x00t\x00o\x00k\x00e\x00n\x00s\x00"\x00:\x00[\x00{\x00"\x00t\x00e\x00x\x00t\x00"\x00:\x00"\x00S\x00S\x00C\x00E\x00T\x00"\x00,\x00"\x00s\x00t\x00a\x00r\x00t\x00"\x00:\x000\x00,\x00"\x00e\x00n\x00d\x00"\x00:\x005\x00,\x00"\x00i\x00d\x00"\x00:\x000\x00}\x00,\x00{\x00"\x00t\x00e\x00x\x00t\x00" ... ... ...

Any ideas what might have happened?


A quick update: removing '\x00' characters from the file via regex replacement (e.g. sed) makes everything work correctly...

Thanks for the report – that's strange, I've never seen this error before :thinking:

The db-out doesn't really do anything fancy, it mostly just dumps the JSON and writes it out. I wonder if the problem you're seeing has something to do with how the data is written to stdout and then redirected to a file.

What happens if you use db-out with the --out-dir argument instead of redirecting to a file? For example, the following will create a file company_ned.jsonl in the current directory:

python -m prodigy db-out company_ned --out-dir ./

You're absolutely right--it's the redirect in powershell.

If I just let the output print to stdout without the redirect, it's fine. But redirect to a file introduces \x00 after every character. This is my first month trying to switch to Windows as a dev environment after 20 years in Linux, things like this make me really think twice...

Thanks for the great remote debugging!

Thanks and yeah, this is really good to know!

We should probably adjust the examples in the docs then to not confuse Windows users. I like allowing redirecting the output to a file because it's very "native" and simple, but the official alternative on Powershell (apparently like this, just with UTF-8 :face_with_monocle:) is pretty unweildy and I'm not sure I want to include this as a recommendation in the docs.