jsonl loading question

Hi, New-ish to Prodigy, not very new to Python, not new at all to working with data and code. That aside, I’m having a spot of trouble with getting Prodigy to load my JSON file:

dsample.jsonl (19.7 KB)

using: prodigy ner.teach mc_apad_listening en_core_web_lg /Users/User/Desktop/myfreshcorpus/input/dsample.jsonl

error:
15:38:29 - Task queue depth is 1
15:38:29 - Task queue depth is 2
15:38:29 - Exception when serving /get_questions
Traceback (most recent call last):
File “cython_src/prodigy/components/loaders.pyx”, line 145, in prodigy.components.loaders.JSONL
ValueError: Expected object or value

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/anaconda2/lib/python3.7/site-packages/waitress/channel.py”, line 336, in service
task.service()
File “/anaconda2/lib/python3.7/site-packages/waitress/task.py”, line 175, in service
self.execute()
File “/anaconda2/lib/python3.7/site-packages/waitress/task.py”, line 452, in execute
app_iter = self.channel.server.application(env, start_response)
File “hug/api.py”, line 423, in hug.api.ModuleSingleton.call.api_auto_instantiate
File “/anaconda2/lib/python3.7/site-packages/falcon/api.py”, line 244, in call
responder(req, resp, **params)
File “hug/interface.py”, line 793, in hug.interface.HTTP.call
File “hug/interface.py”, line 766, in hug.interface.HTTP.call
File “hug/interface.py”, line 703, in hug.interface.HTTP.call_function
File “hug/interface.py”, line 100, in hug.interface.Interfaces.call
File “/anaconda2/lib/python3.7/site-packages/prodigy/app.py”, line 173, in get_questions
tasks = controller.get_questions()
File “cython_src/prodigy/core.pyx”, line 129, in prodigy.core.Controller.get_questions
File “cython_src/prodigy/components/feeds.pyx”, line 56, in prodigy.components.feeds.SharedFeed.get_questions
File “cython_src/prodigy/components/feeds.pyx”, line 61, in prodigy.components.feeds.SharedFeed.get_next_batch
File “cython_src/prodigy/components/feeds.pyx”, line 131, in prodigy.components.feeds.SessionFeed.get_session_stream
File “/anaconda2/lib/python3.7/site-packages/toolz/itertoolz.py”, line 368, in first
return next(iter(seq))
File “cython_src/prodigy/components/sorters.pyx”, line 151, in iter
File “cython_src/prodigy/components/sorters.pyx”, line 61, in genexpr
File “cython_src/prodigy/models/ner.pyx”, line 292, in call
File “cython_src/prodigy/models/ner.pyx”, line 259, in get_tasks
File “cytoolz/itertoolz.pyx”, line 1047, in cytoolz.itertoolz.partition_all.next
File “cython_src/prodigy/models/ner.pyx”, line 209, in predict_spans
File “cytoolz/itertoolz.pyx”, line 1047, in cytoolz.itertoolz.partition_all.next
File “cython_src/prodigy/components/preprocess.pyx”, line 35, in split_sentences
File “/anaconda2/lib/python3.7/site-packages/spacy/language.py”, line 548, in pipe
for doc, context in izip(docs, contexts):
File “/anaconda2/lib/python3.7/site-packages/spacy/language.py”, line 572, in pipe
for doc in docs:
File “nn_parser.pyx”, line 367, in pipe
File “cytoolz/itertoolz.pyx”, line 1047, in cytoolz.itertoolz.partition_all.next
File “nn_parser.pyx”, line 367, in pipe
File “cytoolz/itertoolz.pyx”, line 1047, in cytoolz.itertoolz.partition_all.next
File “pipeline.pyx”, line 431, in pipe
File “cytoolz/itertoolz.pyx”, line 1047, in cytoolz.itertoolz.partition_all.next
File “/anaconda2/lib/python3.7/site-packages/spacy/language.py”, line 746, in _pipe
for doc in docs:
File “/anaconda2/lib/python3.7/site-packages/spacy/language.py”, line 551, in
docs = (self.make_doc(text) for text in texts)
File “/anaconda2/lib/python3.7/site-packages/spacy/language.py”, line 544, in
texts = (tc[0] for tc in text_context1)
File “cython_src/prodigy/components/preprocess.pyx”, line 34, in genexpr
File “cython_src/prodigy/components/filters.pyx”, line 35, in filter_duplicates
File “cython_src/prodigy/components/filters.pyx”, line 16, in filter_empty
File “cython_src/prodigy/components/loaders.pyx”, line 22, in _rehash_stream
File “cython_src/prodigy/components/loaders.pyx”, line 152, in JSONL
ValueError: Failed to load task (invalid JSON).

[
… [

Nothing loads - which has me wondering what I’ve missed.

ADDENDUM: Tried adding labels (Spacy NER) same result - error loading Prodigy on localhost. The JSONL file is 23MB. Is this a possible issue?

Thoughts? Slings? Arrows? Outrageous fortunes?

Best,

B

Hi! If I read this correctly, the problem might be that your file is actually JSON and not JSONL (newline-delimited JSON)? If you load in JSONL, the data is expected to have one JSON object on each line. So internally, all the loader really does is read in the file line-by-line and call json.loads() on each line. This fails for your file, because the first line is [, the second line is { and so on.

So if you just rename your file to .json, it should work as expected, because the file will then be parsed as a whole. For large files, this is going to be slow-ish, so you might still prefer to convert it to JSONL. You can write your own little script (it’s really as straightforward as writing json.dumps(obj) + '\n' to each line) or use Prodigy’s built-in helper:

prodigy.util.write_jsonl('/path/to/file.jsonl', list_of_dicts)

Btw, another small thing I noticed when looking at your sample data: Prodigy expects the text to be available as the key "text" – otherwise, it won’t know which entry to load from your data. So you probably want to rename one of "Title", "Snippet" etc. to "text".

You can also add an entry "meta", a dictionary of key/value pairs that will be shown in the bottom right corner of the annotation card. For example, {"meta": {"link": "http://www.valleyscw.com/..."}}.

2 Likes

Hi! I’ll sit down and make the changes and see what happens!

Thanks!

Well look at that! Worked like a charm!

This, this is great!

B

1 Like

Pardon my resurrecting this "dead horse," but I am unable to use ner.manual with my JSONL patterns file (value of the --patterns argument). Prodigy's error message:

✘ Failed to load task (invalid JSON on line 1)
This error pretty much always means that there's something wrong with this line
of JSON and Python can't load it. Even if you think it's correct, something must
confuse it. Try calling json.loads(line) on each line or use a JSON linter.

The Prodigy error message is consistent with your helpful comments above. The JSONL file is my own creation, each line a valid JSON object. So with a Python script (attached) I have attempted to replicate what I think Prodigy does internally with my JSONL patterns file: consume a JSONL, then transform it to to JSON.

.

Everything in this "round-trip" bench test appears to work. So I'm clearly still missing something and will appreciate your kind advice.

Installations: Python (3.8.6), spaCy (2.3.4) and Prodigy (1.10.5)

Hi! Based on the error message, it seems like this is occuring when Prodigy is trying to load your input data, not the patterns file. So it looks like your patterns are fine (and your code snippet is indeed what Prodigy is doing under the hood :slightly_smiling_face:) but something in the first line of your input file doesn't parse correctly as JSON. Can you double-check that and see if it passes your manual json.loads test?

Thank you Ines once again for kindly supporting me, even with basic (dumb!) errors.

The INPUT (source) file should be also a .jsonl (for sensible chunking). but I violated basic JSONL. Both of the enclosing braces ("{" and "}") were on separate lines, clearly NOT conforming JSONL.