Handling very large fields in csv


(Hannah Lindsley) #1

I’m trying to play around with csv that has some very large fields. I’m getting
_csv.Error: field larger than field limit (131072)
which isn’t all that unexpected. Is there any way to mitigate this through the api?

(Ines Montani) #2

Yeah, Prodigy uses the built-in csv module, so this is definitely possible. Are those fields relevant to your annotation tasks, or are they just random other data that you’re not even planning to use?

I had a look and found this slightly hacky solution – but it looks like this could potentially lead to other errors down the line. So I’m not 100% sure we want to integrate something like this. (If you have ideas, let me know!)

In the meantime, a nice solution could be to write your own, simple pre-processing script that works around this issue (e.g. using the hack mentioned above) and writes the individual annotation tasks to stdout. Prodigy’s build-in recipes that support loading in data from a source argument will default to stdin if no source is set. So you could do something like:

data = load_your_csv_without_limit_or_clean_it()
for row in data:
    # extract your fields and reformat them as annotation tasks
    task = {'text': row.get('text'), 'label': row.get('label')}  # etc.
    print(json.dumps(task))   # print the dumped JSON, one example per line

You can then simply pipe the output forward to the recipe you’re using:

python preprocess_data.py | prodigy ner.teach dataset en_core_web_sm

Of course, you could also do this in a custom recipe if you prefer.

(Hannah Lindsley) #3

Ah, ok great! Yes, they’re important to my annotation task. Essentially, a bunch of documents associated with a person get concatenated into one view of the person, so there’s a lot of data in one record. I’m trying to annotate that documentation.

I’ll make my own recipe for it. Thanks for the help, this is the first time I’ve really sat down and played around with things.