db-out producing json files with extra spaces between each character...

hodsonjames · December 22, 2019, 9:31pm

Hi,

I created a new annotated dataset via the following command:

python -m prodigy ner.manual company_ned blank:en .\all_companies.txt --label ORG,LOCATION,DEPARTMENT,DIVISION,PRODUCT,BRAND,POSITION,LEGAL,OTHER

I annotated about 200 of these, then I wanted to add a couple of labels, leverage the baseline of the SpaCy NER model, and go back over those initial annotations to add extra tags where needed:

python -m prodigy ner.correct company_ned en_core_web_sm .\all_companies.txt --label ORG,LOCATION,DEPARTMENT,DIVISION,PRODUCT,BRAND,POSITION,LEGAL,OTHER,TYPE,SUBSIDIARY

This didn't work--the framework refused to revisit the original annotations to correct them. I did some reading and figured out that I am supposed to output the labels so far, and then read them back in as an input data set in order to revisit them. Ok, so then I did:

python -m prodigy db-out company_ned > company_ned_1.jsonl

The json-lines file produced can't be pasted here as it contains interspersed non-printable 'space' characters. However, when trying to load it into prodicgy, I get the following error:

Task exception was never retrieved
future: <Task finished coro=<RequestResponseCycle.run_asgi() done, defined at C:\Users\james\venv\science\science\lib\site-packages\uvicorn\protocols\http\h11_impl.py:383> exception=UnicodeDecodeError('utf-8', b'\xff\xfe{\x00"\x00t\x00e\x00x\x00t\x00"\x00:\x00"\x00S\x00S\x00C\x00E\x00T\x00,\x00 \x00B\x00h\x00i\x00l\x00a\x00i\x00"\x00,\x00"\x00_\x00i\x00n\x00p\x00u\x00t\x00_\x00h\x00a\x00s\x00h\x00"\x00:\x00-\x006\x006\x002\x000\x003\x003\x000\x007\x007\x00,\x00"\x00_\x00t\x00a\x00s\x00k\x00_\x00h\x00a\x00s\x00h\x00"\x00:\x00-\x001\x002\x003\x002\x007\x002\x007\x003\x006\x00,\x00"\x00t\x00o\x00k\x00e\x00n\x00s\x00"\x00:\x00[\x00{\x00"\x00t\x00e\x00x\x00t\x00"\x00:\x00"\x00S\x00S\x00C\x00E\x00T\x00"\x00,\x00"\x00s\x00t\x00a\x00r\x00t\x00"\x00:\x000\x00,\x00"\x00e\x00n\x00d\x00"\x00:\x005\x00,\x00"\x00i\x00d\x00"\x00:\x000\x00}\x00,\x00{\x00"\x00t\x00e\x00x\x00t\x00" ... ... ...

Any ideas what might have happened?

Thanks!

hodsonjames · December 22, 2019, 9:36pm

A quick update: removing '\x00' characters from the file via regex replacement (e.g. sed) makes everything work correctly...

ines · December 22, 2019, 10:44pm

Thanks for the report – that's strange, I've never seen this error before

The db-out doesn't really do anything fancy, it mostly just dumps the JSON and writes it out. I wonder if the problem you're seeing has something to do with how the data is written to stdout and then redirected to a file.

What happens if you use db-out with the --out-dir argument instead of redirecting to a file? For example, the following will create a file company_ned.jsonl in the current directory:

python -m prodigy db-out company_ned --out-dir ./

hodsonjames · December 23, 2019, 12:14am

You're absolutely right--it's the redirect in powershell.

If I just let the output print to stdout without the redirect, it's fine. But redirect to a file introduces \x00 after every character. This is my first month trying to switch to Windows as a dev environment after 20 years in Linux, things like this make me really think twice...

Thanks for the great remote debugging!
James

ines · December 23, 2019, 12:53pm

Thanks and yeah, this is really good to know!

We should probably adjust the examples in the docs then to not confuse Windows users. I like allowing redirecting the output to a file because it's very "native" and simple, but the official alternative on Powershell (apparently like this, just with UTF-8 ) is pretty unweildy and I'm not sure I want to include this as a recommendation in the docs.

Topic		Replies	Views
Boundaries (token/offsets) on Ner annotations ner , database , solved	1	535	October 16, 2019
ValueError: Unmatched ''"' when when decoding 'string' database , solved	5	5808	May 17, 2020
LABELS showing as TXT in DB-Output JSONL && PDF-Prodigy Approach ner , install , custom	1	156	May 25, 2024
Labelling / Annotating data affected by pre-processing usage , ner	1	379	September 4, 2019
HTML to jsonl and NER task workflow usage , ner , solved	6	851	July 19, 2019

db-out producing json files with extra spaces between each character...

Related topics