Loading Prodigy output back into Prodigy

Hi,

I'm running an annotation project with Prodigy and we want to run a second pass over our annotations. In order to do this, I would need to load back into Prodigy a copy of our first round of annotations. I read elsewhere on this forum that I should just be able to run the command line script with output from the previous session and get those annotations highlighted in the new session. However, when I try to do that, I just get the following error

Traceback (most recent call last):
  File "/home/USER/miniconda3/envs/prodigy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/USER/miniconda3/envs/prodigy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/USER/miniconda3/envs/prodigy/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 389, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 73, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 170, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 104, in prodigy.components.feeds.Feed.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 150, in prodigy.components.feeds.Feed._init_stream
  File "cython_src/prodigy/components/stream.pyx", line 107, in prodigy.components.stream.Stream.__init__
  File "cython_src/prodigy/components/stream.pyx", line 58, in prodigy.components.stream.validate_stream
  File "cython_src/prodigy/components/preprocess.pyx", line 168, in add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 264, in prodigy.components.preprocess._add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 226, in prodigy.components.preprocess.sync_spans_to_tokens
TypeError: string indices must be integers

What might be causing the error? I'm using this script to run Prodigy:

PRODIGY_PORT=8082 prodigy ner.manual database en_core_web_sm data/for_prodigy/data.jsonl --label data/labels

Each entry in the jsonl file I load looks like this (with specific data anonymized here):

{'tokens': <STR LIST OF TOKENS>, 'tags': <STR LIST OF TAGS>], 'spans': [{'start': 13, 'end': 30, 'token_start': 3, 'token_end': 5, 'label': <STR LABEL>}, {'text': <STR TOKEN> 'start': 31, 'end': 37, 'pattern': 1031155696, 'token_start': 6, 'token_end': 6, 'label': <STR LABEL>}, {'text': <STR TOKEN> 'start': 38, 'end': 43, 'pattern': -847644489, 'token_start': 7, 'token_end': 7, 'label': <STR LABEL>}, {'start': 52, 'end': 60, 'token_start': 10, 'token_end': 10, 'label': <STR LABEL>}], 'text': <STR TEXT>}

Any help on this would be much appreciated and happy to post further information. :slight_smile:

Hi @maybemkl!

Thanks for your question and welcome to the Prodigy community :wave:

Are you aware of the dataset: syntax (see this link for docs) where you can load existing datasets as your source?

The dataset: syntax lets you specify an existing dataset as the input source. Prodigy will then load the annotations from the dataset and stream them in again. Annotation interfaces respect pre-defined annotations and will pre-select them in the UI. This is useful if you want to re-annotate a dataset to correct it, or if you want to add new information with a different interface. The following command will stream in annotations from the dataset ner_data and save the resulting reannotated data in a new dataset ner_data_new:

Example

prodigy ner.manual ner_data_new blank:en dataset:ner_data --label PERSON,ORG

As the docs say, you can even review only certain type (e.g., dataset:ner_data:ignore would be only ignored annotations).

Would this solve your problem?

Hi @ryanwesslen , thanks for the super rapid reply! I tried this, but I still get the same error. Importing the jsonl to with db-in works, but after that I get the same error when I try to start the prodigy session. I could of course use the old datasets, but the problem is that the first round of annotations had multiple annotators, whereas this second round is intended to consolidate the first with just two annotators. So the best approach for us would be to export the data from the first round, merge it, split it into two, and then import it again. Any ideas on what that error might be about? It seems to be identical with the one from before:

Traceback (most recent call last):
  File "/home/USER/miniconda3/envs/prodigy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/USER/miniconda3/envs/prodigy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/USER/miniconda3/envs/prodigy/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 389, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 73, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 170, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 104, in prodigy.components.feeds.Feed.__init__
  File "cython_src/prodigy/components/feeds.pyx", line 150, in prodigy.components.feeds.Feed._init_stream
  File "cython_src/prodigy/components/stream.pyx", line 107, in prodigy.components.stream.Stream.__init__
  File "cython_src/prodigy/components/stream.pyx", line 58, in prodigy.components.stream.validate_stream
  File "cython_src/prodigy/components/preprocess.pyx", line 168, in add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 264, in prodigy.components.preprocess._add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 226, in prodigy.components.preprocess.sync_spans_to_tokens
TypeError: string indices must be integers

Hm... that's weird. Typically, that error has sometimes been a formatting or something that needed restart:

I'm a bit surprised that you saw the same issue with dataset:.

I'd suggest to run that process on (say) the first 10 records. Does it run? If so, then this makes it seem like it's not a systematic issue and instead only affects a few records. If it doesn't run, then perhaps try the last 10 records or randomly take a few rows.

Sorry this is a bit of a trial-and-error but the goal is to figure out if this is an issue with all of the records or only some/certain records. I've done this in the past where I didn't realize there was some a minor data issue with a few sparsely populated records. If you do find it's with only some records, then try to find an example of it and understand why.

This is a bit of a challenge without a fully reproducible example (completely understand you can't provide due to data privacy). If possible, if you could create 1 similar dummy, ready-to-go .jsonl that will help me a ton because then I can reproduce it. Also, provide what are the exact commands if possible.