Handling very large fields in csv

hannahlindsley · January 10, 2018, 7:40pm

I’m trying to play around with csv that has some very large fields. I’m getting
_csv.Error: field larger than field limit (131072)
which isn’t all that unexpected. Is there any way to mitigate this through the api?

ines · January 10, 2018, 7:56pm

Yeah, Prodigy uses the built-in csv module, so this is definitely possible. Are those fields relevant to your annotation tasks, or are they just random other data that you’re not even planning to use?

I had a look and found this slightly hacky solution – but it looks like this could potentially lead to other errors down the line. So I’m not 100% sure we want to integrate something like this. (If you have ideas, let me know!)

In the meantime, a nice solution could be to write your own, simple pre-processing script that works around this issue (e.g. using the hack mentioned above) and writes the individual annotation tasks to stdout. Prodigy’s build-in recipes that support loading in data from a source argument will default to stdin if no source is set. So you could do something like:

data = load_your_csv_without_limit_or_clean_it()
for row in data:
    # extract your fields and reformat them as annotation tasks
    task = {'text': row.get('text'), 'label': row.get('label')}  # etc.
    print(json.dumps(task))   # print the dumped JSON, one example per line

You can then simply pipe the output forward to the recipe you’re using:

python preprocess_data.py | prodigy ner.teach dataset en_core_web_sm

Of course, you could also do this in a custom recipe if you prefer.

hannahlindsley · January 10, 2018, 8:00pm

Ah, ok great! Yes, they’re important to my annotation task. Essentially, a bunch of documents associated with a person get concatenated into one view of the person, so there’s a lot of data in one record. I’m trying to annotate that documentation.

I’ll make my own recipe for it. Thanks for the help, this is the first time I’ve really sat down and played around with things.

Topic		Replies	Views
Long annotation task is not saved properly textcat , database	1	462	December 1, 2019
ner.manual not going through all annotations in a CSV file ner , server	14	789	March 21, 2020
CSV File Text Annotation usage , solved	3	3048	March 11, 2020
documents length and annotation time usage , ner , solved , streams	13	941	December 4, 2020
Managing long annotation sessions usage , streams	3	671	November 1, 2019

Handling very large fields in csv

Related topics