json linting and dataset validation checks missing till you load a data point (bug report)

dhruvsakalley · September 25, 2022, 11:24am

I was tagging and mid session the app failed to load a record stating there was something lost of the data, I have saved my annotations along the way but now I have to restart the task after fixing the data error. Would have been much better if the error was known upfront before time was spent tagging. a simple json.load on each line as a validation by prodigy might have prevented this. Sounds like something the tool must do at the beginning of the task load and not midway through the experiment.

koaning · September 26, 2022, 9:26am

Thanks for reporting! This is an interesting one.

I can imagine you wouldn't want to lint through the entire file every time because JSONL files can be extremely long. A possible alternative could be to add a linter command to Prodigy which checks if the JSONL file has the required data for a certain task or an option in the config that allows Prodigy to skip bad JSON lines via something like try/except. There are pros and cons to consider here but I'll discuss this with the team!

Will ping back once we've reached consensus.

koaning · September 26, 2022, 9:36am

It turns out that we do check the first batch of the .jsonl file. This appeared in my output just now on another task:

> python -m prodigy textcat.manual issue-5975 examples.jsonl --label pos,neg
Using 2 label(s): pos, neg
Added dataset issue-5975 to database SQLite.

✘ Failed to load task (invalid JSON on line 1)
This error pretty much always means that there's something wrong with this line
of JSON and Python can't load it. Even if you think it's correct, something must
confuse it. Try calling json.loads(line) on each line or use a JSON linter.

koaning · September 26, 2022, 2:56pm

The consensus is that we'll explore a lint command. It might help to check the datasets, but also the prodigy.json config file. No promises on a due date, but it's a feature that's currently been taken into consideration.

Thanks again for the ping !

Topic		Replies	Views
loading my oun data (jsonl) usage , solved	3	730	July 30, 2019
jsonl loading question usage , ner , solved	6	2508	January 7, 2021
Loading message prodigy UI usage , solved	7	807	September 12, 2019
Cannot load data with db-in on Prodigy 1.8.3 using annotations created with 1.6 done , database , solved	7	1541	June 13, 2019
jsonl error handling	4	1148	December 1, 2017

json linting and dataset validation checks missing till you load a data point (bug report)

Related topics