unable to use binary classification for images using jsonl loader


I'm a relatively new prodigy user and I am facing an issue where I am unable to get binary image classification going using a jsonl loader using the mark recipe.
My jsonl is in the format
{'image': '/path/to/image.png'}
On using verbose logging, I see the "404 not found"
Prodigy seems to try to load the file using HTTP instead of from disk. Is there a way to get it to use file paths instead of URLs? I don't face this issue when using the image.manual recipe.
Any help would be appreciated.


Hi! The problem here is that when you load in your data via JSON(L), Prodigy will just take exactly what's in the data and render it in the UI – in this case, images from local file paths. However, images and other files from local paths are typically blocked by modern browsers for security reasons – you can disable this, but it's not usually recommended.

The image recipes use the fetch_media pre-processor to convert images from local paths to base64-encoded strings, so you can send them directly over the API. This works fine for smaller files and means you can store the data with the annotations. If your files are larger, it's usually better to serve them from a local web server or URLs instead. See my comment here for more details.

So if you want to load your images from a JSONL file instead of a directory, the easiest solution would be to either start a local web server in the top-level directory and use localhost paths, or to put them in an S3 bucket or similar.

Hi Ines,

thank you for your response. I tried to make the change you suggested. It seems straight forward but looks like I am missing something still, and am unable to spot the problem.

I changed the input file to look like so. (I tried with and without the 'label' enries, but the output was the same):

{"image": "localhost:8000/testFiles/Ex1.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex2.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex3.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex4.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex5.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex6.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex7.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex8.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex9.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex10.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex11.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex12.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex13.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex14.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex15.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex16.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex17.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex18.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex19.jpg", "label": "LESS_TRAVELLED"}
{"image": "localhost:8000/testFiles/Ex20.jpg", "label": "LESS_TRAVELLED"}

I ran a python webserver on the dir. I tested the paths on the webserver individually, they work, however, prodigy still doesn't show the image during the task. This is the console output of prodigy

PRODIGY_LOGGING=verbose prodigy mark test test.jsonl --loader jsonl --label LESS_TRAVELLED --view-id classification
18:02:48: INIT: Setting all logging levels to 10
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
18:02:48: RECIPE: Calling recipe 'mark'
Using 1 label(s): LESS_TRAVELLED
18:02:48: RECIPE: Starting recipe mark
{'dataset': 'test', 'source': 'test.jsonl', 'view_id': 'classification', 'loader': 'jsonl', 'memorize': False, 'exclude': None, 'label': ['LESS_TRAVELLED']}

18:02:48: LOADER: Loading stream from jsonl
18:02:48: VALIDATE: Validating components returned by recipe
18:02:48: CONTROLLER: Initialising from recipe
{'before_db': None, 'config': {'force_stream_order': True, 'label': 'LESS_TRAVELLED', 'dataset': 'test', 'recipe_name': 'mark'}, 'dataset': 'test', 'db': True, 'exclude': None, 'get_session_id': None, 'on_exit': <function mark.<locals>.print_results at 0x7f5bc3ade550>, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x7f5bd6b06880>, 'self': <prodigy.core.Controller object at 0x7f5bc9790490>, 'stream': <generator object mark.<locals>.ask_questions at 0x7f5bc3a68120>, 'update': None, 'validate_answer': None, 'view_id': 'classification'}

18:02:48: VALIDATE: Creating validator for view ID 'classification'
18:02:48: VALIDATE: Validating Prodigy and recipe config
18:02:48: DB: Initializing database SQLite
18:02:48: DB: Connecting to database SQLite
18:02:48: DB: Creating dataset '2021-07-12_18-02-48'
{'created': datetime.datetime(2021, 7, 12, 16, 20, 59)}

18:02:48: DatasetFilter: Excluding examples based on task hashes
18:02:48: DatasetFilter: Getting hashes for excluded examples
18:02:48: DatasetFilter: Excluding 0 tasks from datasets: test
18:02:48: CONTROLLER: Initialising from recipe
{'batch_size': 10, 'dataset': 'test', 'db': <prodigy.components.db.Database object at 0x7f5bc3ae3670>, 'exclude': 'task', 'filters': [{'name': 'DatasetFilter', 'datasets': ['test']}], 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.RepeatingFeed object at 0x7f5bc3ae3b50>, 'stream': <generator object mark.<locals>.ask_questions at 0x7f5bc3a68120>, 'validator': <prodigy.components.validate.Validator object at 0x7f5bc97904f0>, 'view_id': 'classification'}

18:02:48: CONTROLLER: Validating the first batch for session: None
18:02:48: CORS: initialized with wildcard "*" CORS origins

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

INFO:     Started server process [35265]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8080 (Press CTRL+C to quit)
INFO: - "GET / HTTP/1.1" 200 OK
INFO: - "GET /bundle.js HTTP/1.1" 200 OK
18:02:56: GET: /project
{'force_stream_order': True, 'label': 'LESS_TRAVELLED', 'dataset': 'test', 'recipe_name': 'mark', 'hint_pending_answers': True, 'view_id': 'classification', 'batch_size': 10, 'version': '1.10.8'}

INFO: - "GET /project HTTP/1.1" 200 OK
18:02:57: POST: /get_session_questions
18:02:57: FEED: Finding next batch of questions in stream
18:02:57: RESPONSE: /get_session_questions (10 examples)
{'tasks': [{'image': 'localhost:8000/testFiles/Ex1.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': 1133687673, '_task_hash': -1968890019, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex2.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': -1278723004, '_task_hash': 249945380, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex3.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': 748939934, '_task_hash': 1886336830, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex4.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': -1139307327, '_task_hash': 1265296011, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex5.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': 1215961313, '_task_hash': -291652115, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex6.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': -1115068182, '_task_hash': 821268107, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex7.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': -732147153, '_task_hash': -1721772183, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex8.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': -1840728251, '_task_hash': -358265106, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex9.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': -1065341484, '_task_hash': 861402339, '_session_id': None, '_view_id': 'classification'}, {'image': 'localhost:8000/testFiles/Ex10.jpg', 'label': 'LESS_TRAVELLED', '_input_hash': -201566043, '_task_hash': 2119986407, '_session_id': None, '_view_id': 'classification'}], 'total': 0, 'progress': None, 'session_id': '2021-07-12_18-02-48'}

INFO: - "POST /get_session_questions HTTP/1.1" 200 OK

I can view the image if I access the url in the logs (eg. localhost:8000/testFiles/Ex1.jpg) However prodigy output looks like this:

Can you help point out what I am missing still?

Thanks & Regards,

I think you need to explicitly make it http://localhost:8000 so the browser knows it's a URL! Could you try that and see if it works now?

That has fixed the issue. Seems incredibly silly now that I realize it. Thanks a lot for your help.

1 Like