Strange UTF-8 Issue

Hello,

The way I receive my data is in a CSV file.
image

Sometimes this data has "byte 0x93" (i.e., non-utf-8 quotations) which throws the following error:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\site-packages\prodigy\__main__.py", line 53, in <module>       
    controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 331, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src\prodigy\core.pyx", line 353, in prodigy.core._components_to_ctrl
  File "cython_src\prodigy\core.pyx", line 142, in prodigy.core.Controller.__init__
  File "cython_src\prodigy\components\feeds.pyx", line 56, in prodigy.components.feeds.SharedFeed.__init__
  File "cython_src\prodigy\components\feeds.pyx", line 155, in prodigy.components.feeds.SharedFeed.validate_stream    
  File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\site-packages\toolz\itertoolz.py", line 376, in first
    return next(iter(seq))
  File "cython_src\prodigy\components\preprocess.pyx", line 128, in add_tokens
  File "cython_src\prodigy\components\filters.pyx", line 37, in filter_duplicates
  File "cython_src\prodigy\components\filters.pyx", line 13, in filter_empty
  File "cython_src\prodigy\components\loaders.pyx", line 180, in CSV
  File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\csv.py", line 110, in __next__
    self.fieldnames
  File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\csv.py", line 97, in fieldnames
    self._fieldnames = next(self.reader)
  File "C:\ProgramData\Anaconda3\envs\prodigy_test\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
**UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 170: invalid start byte**

So I save the CSV in excel as utf-8 compliant so as not to upset any parsers. I can do this either of two ways:

image
and/or
image

However, after I do this, I see the following after running this command:

python -m prodigy ner.manual test_txt7 blank:en ./annotations.csv 
--label LABEL1,LABEL2

Any ideas how to properly clean these and make Prodigy happy?

Thanks so much.

edit: For good measure, I confirmed that the length of the stream is zero:

from prodigy.components.loaders import CSV
stream = list(CSV("annotations.csv"))
print(len(stream))
Result: 0

Hi! The way you fixed the UTF-8 encoding looks good and it seems to have solved the underlying issue. Maybe there's some problem with how Excel creates your CSV file? How does the file look when you open it in a text editor?

Under the hood, the CSV loader calls into Python's csv.DictReader, so you could try the following to check if your CSV is interpreted correctly:

from pathlib import Path
import csv

f = Path("annotations.csv").open("r", encoding="utf8")
reader = csv.DictReader(f)
for row in reader:
    print(row)

If the above works and row["Text"] correctly retrurns your text, then Prodigy should be able to load it.

annotations.csv looks like this when I run it through the DictReader:
image

In notepad it looks like this:
image

annotations.csv is a CSV UTF-8 (Comma Delimited) file. I did a bunch of research to determine if the "ufeff" is messing things up, but I'm been unsuccessful in trying to remove it from my CSV. One option is to modify encoding to utf-8-sig, but I'm also had no luck (see solution) implementing this solution (and to be honest, I don't know if its actually causing the problem or not). I assume that Prodigy is reading "\ufeff" and can't find the "Text" reserve word. This is what is causing the stream to be empty?

I just can't seem to clear the CSV to get it to work with prodigy. I've tried utf-8-sig (which clears out the ufeff when I change the encoding on your code):

with open('annotations.csv','r',encoding='utf-8-sig') as fin:
     with open('annotations2.csv','w',encoding='utf-8-sig') as fout:
         fout.write(fin.read())
PS C:\ProgramData\Anaconda3\envs\prodigy_test\code> python -m prodigy ner.manual test_txt7 blank:en ./annotations2.csv --label LABEL1,LABEL2
Using 2 label(s): LABEL1, LABEL2

✘ Error while validating stream: no first example
This likely means that your stream is empty.

Re-running your code for annotations2.csv (using utf8) puts me back to where I started:
image

Yes, that seems to be what's going on. I hadn't seen this \ufeff marker before and after some googling, it does seems like this is related to how Windows UTF-8 files are saved. So it's definitely something you want to fix because it can easily cause you some headache with CSVs in other processes later on, because it'll always confuse Python.

Maybe you can just save the plain text file as UTF-8 directly, instead of going via Excel? It seems like that adds some complexity and can make it difficult to understand what's going on under the hood.

Another option would be to have a simple Python script that opens your CSV file with the utf-8-sig encoding and then uses csv.writer or pandas to save out the CSV as utf-8. This should work, but it's probably still good to figure out how to save CSV files with the correct encoding directly, because otherwise, you'll always have to use this separate step whenever you want to load your CSVs with Python.

Forgot to update with the final solution. For those who get their data in CSV format, you can simply save the file in UTF-8 format in excel and run the following code to clear out of all of the "\ufeff"

with open('test2.csv',mode='r', encoding='utf-8-sig') as fin:
    with open('test3.csv',mode='w',encoding='utf8') as fout:
       fout.write(fin.read())

Once you do that, you'll be able to use your CSV file without getting all those errors.

@ines — a question for you. Do people in industry typically use JSON files? Or is CSV also a popular choice? Are there any downsides to using CSV in the context of text-based training using Prodigy?

1 Like

Thanks for the update, glad to hear you found a solution :blush:

CSV is definitely very common, especially for plain text or text + labels. But it also has some limitations: you can't easily represent more complex and nested data structures, like lists of objects or nested dictionaries. In NLP, it's common to have annotations like spans (multiple character or token offsets with labels) or dependencies and relationships (token pairs and labels). Those are difficult to represent in a flat table structure, and it easily becomes difficult to read or requires additional parsing logic.

JSON is a lot more flexible here because it supports nested structures out-of-the-box. It also has a clear notation for strings, integers and boolean values.

JSONL is still a bit more niche, but we've found it very useful for NLP data, because it solves one of the problems with JSON (as opposed to CSV or plain text): it contains one JSON object per line, so it can be read in line by line, instead of requiring the whole file to be read into memory first. That's a lot more efficient, and also why we chose it for Prodigy.

In Prodigy specifically, there's not really a downside to using CSV to read in your texts, vs. some other format. However, if you want to load in pre-annotated data with tokens or spans, you can't really express this in a CSV file. So in that case, you want to use a JSON or JSONL file instead.

1 Like