ValueError: Trailing data

I am trying to leverage the ner recipe and transformed the data to desired prodigy format. But I encountered following error msg when running ner.teach

Task queue depth is 1
Exception when serving /get_session_questions
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/waitress/channel.py", line 336, in service
    task.service()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/waitress/task.py", line 175, in service
    self.execute()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/waitress/task.py", line 452, in execute
    app_iter = self.channel.server.application(env, start_response)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/hug/api.py", line 451, in api_auto_instantiate
    return module.__hug_wsgi__(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/hug/interface.py", line 789, in __call__
    raise exception
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/hug/interface.py", line 762, in __call__
    self.render_content(self.call_function(input_parameters), context, request, response, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/hug/interface.py", line 698, in call_function
    return self.interface(**parameters)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/hug/interface.py", line 100, in __call__
    return __hug_internal_self._function(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/prodigy/_api/hug_app.py", line 228, in get_session_questions
    tasks = controller.get_questions(session_id=session_id)
  File "cython_src/prodigy/core.pyx", line 130, in prodigy.core.Controller.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 58, in prodigy.components.feeds.SharedFeed.get_questions
  File "cython_src/prodigy/components/feeds.pyx", line 63, in prodigy.components.feeds.SharedFeed.get_next_batch
  File "cython_src/prodigy/components/feeds.pyx", line 140, in prodigy.components.feeds.SessionFeed.get_session_stream
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "cython_src/prodigy/components/sorters.pyx", line 151, in __iter__
  File "cython_src/prodigy/components/sorters.pyx", line 61, in genexpr
  File "cython_src/prodigy/models/ner.pyx", line 292, in __call__
  File "cython_src/prodigy/models/ner.pyx", line 261, in get_tasks
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/toolz/itertoolz.py", line 716, in partition_all
    prev = next(it)
  File "cython_src/prodigy/models/ner.pyx", line 211, in predict_spans
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/toolz/itertoolz.py", line 716, in partition_all
    prev = next(it)
  File "cython_src/prodigy/components/preprocess.pyx", line 39, in split_sentences
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/language.py", line 688, in pipe
    for doc, context in izip(docs, contexts):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/language.py", line 716, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 221, in pipe
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/util.py", line 457, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "nn_parser.pyx", line 221, in pipe
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/util.py", line 457, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "pipes.pyx", line 379, in pipe
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/util.py", line 457, in minibatch
    batch = list(itertools.islice(items, int(batch_size)))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/language.py", line 903, in _pipe
    for doc in docs:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/language.py", line 691, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/spacy/language.py", line 680, in <genexpr>
    texts = (tc[0] for tc in text_context1)
  File "cython_src/prodigy/components/preprocess.pyx", line 38, in genexpr
  File "cython_src/prodigy/components/filters.pyx", line 35, in filter_duplicates
  File "cython_src/prodigy/components/filters.pyx", line 16, in filter_empty
  File "cython_src/prodigy/components/loaders.pyx", line 22, in _rehash_stream
  File "cython_src/prodigy/components/loaders.pyx", line 163, in JSON
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/srsly/_json_api.py", line 37, in json_loads
    return ujson.loads(data)
ValueError: Trailing data

Is this something wrong with the data I imported?

Hi! It looks like you might have ended up with one line that’s invalid JSON – for example, because it’s missing a line break between two entries (which I think is what the “trailing data” error usually refers to).

To find the problematic line, you could use a little script like this:

from pathlib import Path
import json

with Path("your_file.jsonl").open("r") as f:
    for line in f:
        try:
            json.loads(f)
        except:
            print("Problem:", line)

I saw that there is problem at each line :sweat_smile:

My json file looks like this:

{
  "text": "xxxxxxxxx",
  "spans": [
    {
      "start": 8,
      "end": 15,
      "label": "xxx"
    },
    {
      "start": 178,
      "end": 197,
      "label": "xxx"
    },
    {
      "start": 224,
      "end": 251,
      "label": "xxx"
    },
    {
      "start": 114,
      "end": 152,
      "label": "xxx"
    },
    {
      "start": 292,
      "end": 298,
      "label": "xxx"
    },
    {
      "start": 329,
      "end": 368,
      "label": "xxx"
    },
    {
      "start": xxx,
      "end": xxx,
      "label": "xxx"
    }
  ]
}

while my code is transforming it into a normal json file, is there a way to transform into a right format (i.e. one line with no line. break)?
This is my code:

with open("creatednew_jsonfile", "a") as output_file:
            json.dump(d, output_file, indent=2)

Ahh okay – that explains a lot! JSONL is newline-delimited JSON, so one record needs to be on one line. (The advantage is that you can read it in line by line, but it also produces very long lines).

It looks like you’ve creared a regular JSON file? So if you name it .json or set --loader json on the command line, it should be read in as JSON and work as expected.

If you do want JSONL, you’d have to write '\n'.join([json.dumps(line) for line in your_data]) to your file. Or you can use the helper function we provide in our library srsly:

your_data = [{"text": "foo"}, {"text": "bar"}]  # whatever your data is
import srsly  # that's our little serialization library
srsly.write_jsonl("/path/to/file.jsonl", your_data)

Thanks a lot for your response. I have imported your approach but is there a reason my output looks like this? The values are missing.

"text"
"spans"
"text"
"spans"
"text"
"spans"
"text"

It looks like you might be accidentally iterating over a dictionary here instead of a list of dictionaries? For example, [x for x in {"text": "foo", "spans": []}] produces ["text", "spans"].

Hi Ines,

Thanks a lot for your answers. I am not very clear on what my data’s format has been conflicting with the arguments needed as you mentioned above. I saw in one of the page at support, you mentioned that the prodigy format should be something like this:

{"Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .", "spans": [{"start": 48, "end": 54, "label": "GPE"}, {"start": 77, "end": 81, "label": "GPE"}, {"start": 111, "end": 118, "label": "GPE"}]}

which is exactly my format is:

{"text":xxxx, "spans": [{"start":xxx,"end":xxx, "label":xxx},{...},{...}]}

And in my understanding, this is not a list of dictionaries but one dictionary? Is there any suggested way to transform it?

That format looks correct, yes – each example in your data should look like this. So assuming you have more than one text, the resulting data should be a list of dictionaries where every example looks like the ones you posted above :slightly_smiling_face:

If your data is a list of examples that look like that, you should be able to save it to JSON or JSONL without any problems. For example:

your_data = [
    {"text": "...", "spans": [...]}, 
    {"text": "...", "spans": [...]}
]

import srsly
srsly.write_jsonl("/path/to/file.jsonl", your_data)