Prodigy fails to start with empty stream

Hello,

We have nightly process that loads new tasks that will be processed in prodigy. After this is done, we restart prodigy so that new tasks get loaded and displayed to the annotators. Sometimes, this nightly process does no load new tasks, and when prodigy restarts, it fails with error Error while validating stream: no first batch. This likely means that your stream is empty. The problem is that we’re running prodigy in docker, so the container keeps restarting and that same exception is shown every time. I was wondering if it’s possible to run prodigy even if there are no tasks, and have the web interface show No tasks available.

Thanks

Ah, that’s an interesting edge case. We specifically added the validation in one of the recent versions to make Prodigy fail more gracefully, because in 99% of the cases, an empty stream indicates a problem with the input data and is not what the user wants.

However, if you know that your formats are correct, you can set "validate": false in your config, which will disable all of those checks (stream validation, task validation etc.)

Hi Ines,

Thanks for your response. That seems to work, but it seems a bit dangerous to disable validation. I was wondering if this is something that could be implemented in a future release? Also, is there a function that we can call to manually perform the validation?

Thanks

Yes! Internally, the validaton is implemented via a JSON schema, so the most flexible way to perform your own is to use the prodigy.get_schema method, which takes a view_id (for example, 'ner') and returns the schema. You can find more details and documentation in your PRODIGY_README.html. There are various libraries for all kinds of languages that implement the JSON schema validation.

Alternatively, you can also use the validate component – this is currently internals, so not officially documented. But it’s pretty straightforward:

from prodigy.components.validate import Validator

# create the validator
validator = Validator('ner')

def validate_stream(stream):
    for task in stream:
        # validate a task object
        validator.check(task)

Great, thanks. One last question: what’s the return type of that validator.check function? True/False? Or does it raise an exception if the validation fails?

Thanks

Ah, sorry, forgot to include that. It only raises with a formatted error message showing the exact path and fields that didn’t pass the check. If you only need a boolean value, you can use validator.is_valid(task) instead.

(Under the hood, we use the jsonschema.Draft4Validator btw and get the errors using the iter_errors method.)

Thanks

1 Like