Hello,
The way I receive my data is in a CSV file.
Sometimes this data has "byte 0x93" (i.e., non-utf-8 quotations) which throws the following error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\site-packages\prodigy\__main__.py", line 53, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src\prodigy\core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "cython_src\prodigy\core.pyx", line 353, in prodigy.core._components_to_ctrl
File "cython_src\prodigy\core.pyx", line 142, in prodigy.core.Controller.__init__
File "cython_src\prodigy\components\feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.__init__
File "cython_src\prodigy\components\feeds.pyx", line 155, in prodigy.components.feeds.SharedFeed.validate_stream
File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\site-packages\toolz\itertoolz.py", line 376, in first
return next(iter(seq))
File "cython_src\prodigy\components\preprocess.pyx", line 128, in add_tokens
File "cython_src\prodigy\components\filters.pyx", line 37, in filter_duplicates
File "cython_src\prodigy\components\filters.pyx", line 13, in filter_empty
File "cython_src\prodigy\components\loaders.pyx", line 180, in CSV
File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\csv.py", line 110, in __next__
self.fieldnames
File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\csv.py", line 97, in fieldnames
self._fieldnames = next(self.reader)
File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
**UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 170: invalid start byte**
So I save the CSV in excel as utf-8 compliant so as not to upset any parsers. I can do this either of two ways:
and/or
However, after I do this, I see the following after running this command:
python -m prodigy ner.manual test_txt7 blank:en ./annotations.csv
--label LABEL1,LABEL2
Any ideas how to properly clean these and make Prodigy happy?
Thanks so much.
edit: For good measure, I confirmed that the length of the stream is zero:
from prodigy.components.loaders import CSV
stream = list(CSV("annotations.csv"))
print(len(stream))
Result: 0