Error running recipe with CSV file

Hi there,

When running python3 -m prodigy ner.correct poc_en_parties en_core_web_sm ./input.csv --label ORG,PERSON --unsegmented, I get the following output:

Using 2 label(s): ORG, PERSON
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/forge/.local/lib/python3.7/site-packages/prodigy/__main__.py", line 61, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 335, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 362, in prodigy.core._components_to_ctrl
  File "cython_src/prodigy/core.pyx", line 123, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 168, in prodigy.components.feeds.Feed.__init__
  File "cython_src/prodigy/components/stream.pyx", line 110, in prodigy.components.stream.Stream.__init__
  File "cython_src/prodigy/components/stream.pyx", line 116, in prodigy.components.stream.Stream._start_count
  File "cython_src/prodigy/components/stream.pyx", line 135, in prodigy.components.stream.Stream._get_buffer
  File "/home/forge/.local/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 244, in make_tasks
    for doc, eg in nlp.pipe(texts, as_tuples=True, batch_size=10):
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1488, in pipe
    for doc, context in zip(docs, contexts):
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1521, in pipe
    for doc in docs:
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1488, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 227, in pipe
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1443, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1488, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/pipe.pyx", line 53, in pipe
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1488, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/pipe.pyx", line 53, in pipe
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1488, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/transition_parser.pyx", line 227, in pipe
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1443, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1488, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 73, in pipe
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1443, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1488, in _pipe
    yield from proc.pipe(docs, **kwargs)
  File "spacy/pipeline/trainable_pipe.pyx", line 73, in pipe
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/util.py", line 1443, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1518, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1479, in <genexpr>
    texts = (tc[0] for tc in text_context1)
  File "/home/forge/.local/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 243, in <genexpr>
    texts = ((eg["text"], eg) for eg in stream)
  File "cython_src/prodigy/components/preprocess.pyx", line 164, in add_tokens
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1488, in pipe
    for doc, context in zip(docs, contexts):
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1521, in pipe
    for doc in docs:
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1518, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/home/forge/.local/lib/python3.7/site-packages/spacy/language.py", line 1479, in <genexpr>
    texts = (tc[0] for tc in text_context1)
  File "cython_src/prodigy/components/preprocess.pyx", line 157, in genexpr
  File "cython_src/prodigy/components/loaders.pyx", line 29, in _add_attrs
  File "cython_src/prodigy/components/filters.pyx", line 46, in filter_duplicates
  File "cython_src/prodigy/components/filters.pyx", line 18, in filter_empty
  File "cython_src/prodigy/components/loaders.pyx", line 23, in _rehash_stream
  File "cython_src/prodigy/components/loaders.pyx", line 195, in CSV
AttributeError: 'NoneType' object has no attribute 'lower'

The exact same command with the exact same input used to work before I upgraded to 1.11.0. Is there something I'm doing wrong or broke in the upgrade?

What does your CSV file and especially its header look like? From the error, it seems like you somehow ended up with a column with header None. Not sure how this happens in Python's csv module, but it's what causes the error here (and it's something Prodigy previously ignored). If you can find a way to fix this in your CSV header, this would be the easiest workaround. We'll also add a workaround for this in the next release to make Prodigy ignore non-string headers.

Unfortunately, I can't share the file since it contains confidential information (except for the header). The header is "Text", the file has one column, with a text per new line.

The file contents don't really matter, it seems to just be the headers. Is there maybe an empty second column? Basically, what seems to happen is that some column comes back as None (but I have no idea why).

You can also test it like this:

from pathlib import Path
import csv

file_path = Path("./input.csv")
f = file_path.open("r", encoding="utf8")
reader = csv.DictReader(f)
first_entry = next(reader)
print(dict(first_entry))

This is the output of your script:
{'text': 'text that is replaced'}

where I replaced the actual text with "text that is replaced". The format seems OK, right?

Thanks for checking – this is super mysterious :thinking: I just don't understand where the None would be coming from. If you go through all your examples, is there any row that comes back with a None key?

Basically, all the CSV loader is doing (and which previously failed) is:

for row in reader:
    row = {key.lower(): value for key, value in row.items()}

I've (hopefully) fixed this by adding an instance check for the key here, but I still want to understand what lead to this and why a key would come back as None.

No, it returns the correct lines, until a similar error occurs:

Traceback (most recent call last):
  File "temp.py", line 10, in <module>
    row = {key.lower(): value for key, value in row.items()}
  File "temp.py", line 10, in <dictcomp>
    row = {key.lower(): value for key, value in row.items()}
AttributeError: 'NoneType' object has no attribute 'lower'

I also tried a very simple csv existing of only:

Text
"This is line 1"
"This is line 2"

that actually seems to work. So there is something specific about my csv that causes problems. But still it surprises me that the same csv used to work before the upgrade to 1.11.0. Before, I was on prodigy-1.10.8. By the way, I'm running python3.7.

My best guess is that you might have an empty line in there or something like that? (Although, I would have expected that to come back as an empty string.) Definitely let me know if you've figured out which line causes the problem, I'm curious!

We did introduce a small fix that was supposed to make the CSV reader more robust, by lowercasing the column headings so they're not case-sensitive (and you could, in theory, have TEXT). What we didn't anticipate was that Python's csv can, in some circumstances, produce None keys.

Edit: Just released v1.11.1, which should fix this by ignoring any non-string keys :slightly_smiling_face:

My best guess is that you might have an empty line in there or something like that? (Although, I would have expected that to come back as an empty string.) Definitely let me know if you've figured out which line causes the problem, I'm curious!

With your script, I found that it got stuck on a line without the text quoting, perhaps interpreting some interpunction as a column separator. That would be my only explanation for now..

Edit: Just released v1.11.1, which should fix this by ignoring any non-string keys :slightly_smiling_face:

Yes, it works perfect now! Thanks!