NER manual source data fomat

Hi All,

I have been tasked with finding the mentions of field specific entities by my organization. I know i have to train a custom model and that i can use prodigy to annotate the spans for my dataset but I am not sure of what the format of the source data should be. Like I see this cmd command on the webpage: Built-in Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP but what is the format of the source dataset? can a text file work or does it have to be json? are there specific keys that should exist in the json file for the spans and entities from prodigy to be incorporated into the file?

hi @fariiaakh!

Thanks for your question and welcome to the Prodigy community :wave:

You can find more details on the source data format on the Input Data section of the documentation.

As the docs provide, the easiest data file format is a .txt file with each document indicated by a new line.

# data.txt
This is a sentence.
This is another sentence.

However, .jsonl is the preferred data format. You will need to put each document as a key-value pair with "text".

# data.jsonl
{"text": "This is a sentence."}
{"text": "This is another sentence.", "meta": {"score": 0.1}}

The "meta" key is optional. Values in this will appear on the user-interface.

Yes for annotated data there are keys for spans (see below). However, these are not needed for unannotated data (e.g., how could you have spans/entities if you haven't annotated the data yet?).

Two other items that may help you from the documentation.

First, be sure to see the Annotation Interfaces, like ner.manual that provides details for each user interface and what to expected data output from the UI. Use this as a template if you have previously annotations that you want to use. For example, this is the output data format for ner.manual:

{
  "text": "First look at the new MacBook Pro",
  "spans": [
    {"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}
  ],
  "tokens": [
    {"text": "First", "start": 0, "end": 5, "id": 0},
    {"text": "look", "start": 6, "end": 10, "id": 1},
    {"text": "at", "start": 11, "end": 13, "id": 2},
    {"text": "the", "start": 14, "end": 17, "id": 3},
    {"text": "new", "start": 18, "end": 21, "id": 4},
    {"text": "MacBook", "start": 22, "end": 29, "id": 5},
    {"text": "Pro", "start": 30, "end": 33, "id": 6}
  ]
}

Second, after you have annotated data, you can pull examples from your annotated datasets using this code:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("my_dataset")

You can then view an example of your data so you can get a better understanding of its structure.

Let me know if you have any further questions!