using existing dataset as the input source for annotation

can I load annotated data to do a correction?

hi @PaulJay1990!

Thanks for your question and welcome to the Prodigy community :wave:

Yes!

The title of your post mentioned "Existing dataset as the input", which would mean you have a Prodigy dataset with the annotations.

Example: Correct from a Prodigy dataset with annotations

As mentioned in the docs, you can use the dataset: prefix in your source:

The dataset: syntax lets you specify an existing dataset as the input source. Prodigy will then load the annotations from the dataset and stream them in again. Annotation interfaces respect pre-defined annotations and will pre-select them in the UI. This is useful if you want to re-annotate a dataset to correct it, or if you want to add new information with a different interface. The following command will stream in annotations from the dataset ner_data and save the resulting reannotated data in a new dataset ner_data_new:

Example: review all dataset annotations

prodigy ner.manual ner_data_new blank:en dataset:ner_data --label PERSON,ORG

Optionally, you can also add another : plus the value of the answer to load if you only want to load examples with specific answers like "accept" or "ignore"

Example: review only accepted

prodigy ner.manual ner_data_new blank:en dataset:ner_data:accept --label PERSON,ORG

Correct from an a file with annotations

Also, not sure if you're also asking on how to annotated an existing file with annotations. If the annotations are in the correct format (see the Annotation interfaces for examples of formats per different annotation interfaces), then the annotations should show up automatically.

Example

Let's say you have ner data labeled:
annotated_news_headlines.jsonl (252.9 KB)

Here's an example:

{
  "text": "How Silicon Valley Pushed Coding Into American Classrooms",
  "meta": {
    "source": "The New York Times"
  },
  "_input_hash": 1842734674,
  "_task_hash": 636683182,
  "tokens": [
    {
      "text": "How",
      "start": 0,
      "end": 3,
      "id": 0
    },
    {
      "text": "Silicon",
      "start": 4,
      "end": 11,
      "id": 1
    },
    {
      "text": "Valley",
      "start": 12,
      "end": 18,
      "id": 2
    },
    {
      "text": "Pushed",
      "start": 19,
      "end": 25,
      "id": 3
    },
    {
      "text": "Coding",
      "start": 26,
      "end": 32,
      "id": 4
    },
    {
      "text": "Into",
      "start": 33,
      "end": 37,
      "id": 5
    },
    {
      "text": "American",
      "start": 38,
      "end": 46,
      "id": 6
    },
    {
      "text": "Classrooms",
      "start": 47,
      "end": 57,
      "id": 7
    }
  ],
  "_session_id": null,
  "_view_id": "ner_manual",
  "spans": [
    {
      "start": 4,
      "end": 18,
      "token_start": 1,
      "token_end": 2,
      "label": "LOCATION"
    }
  ],
  "answer": "accept"
}

You can then run this input data as you would unannotated data:

python -m prodigy ner.manual issue-6489 blank:en data/annotated_news_headlines.jsonl --label PERSON,ORG,LOCATION
Using 3 label(s): PERSON, ORG, LOCATION

Hope this helps!

Thank you.