json linting and dataset validation checks missing till you load a data point (bug report)

I was tagging and mid session the app failed to load a record stating there was something lost of the data, I have saved my annotations along the way but now I have to restart the task after fixing the data error. Would have been much better if the error was known upfront before time was spent tagging. a simple json.load on each line as a validation by prodigy might have prevented this. Sounds like something the tool must do at the beginning of the task load and not midway through the experiment.
image

Thanks for reporting! This is an interesting one.

I can imagine you wouldn't want to lint through the entire file every time because JSONL files can be extremely long. A possible alternative could be to add a linter command to Prodigy which checks if the JSONL file has the required data for a certain task or an option in the config that allows Prodigy to skip bad JSON lines via something like try/except. There are pros and cons to consider here but I'll discuss this with the team!

Will ping back once we've reached consensus.

It turns out that we do check the first batch of the .jsonl file. This appeared in my output just now on another task:

> python -m prodigy textcat.manual issue-5975 examples.jsonl --label pos,neg
Using 2 label(s): pos, neg
Added dataset issue-5975 to database SQLite.

✘ Failed to load task (invalid JSON on line 1)
This error pretty much always means that there's something wrong with this line
of JSON and Python can't load it. Even if you think it's correct, something must
confuse it. Try calling json.loads(line) on each line or use a JSON linter.

The consensus is that we'll explore a lint command. It might help to check the datasets, but also the prodigy.json config file. No promises on a due date, but it's a feature that's currently been taken into consideration.

Thanks again for the ping :slight_smile: !

1 Like