NER manual source data fomat

fariiaakh · September 19, 2022, 6:21pm

Hi All,

I have been tasked with finding the mentions of field specific entities by my organization. I know i have to train a custom model and that i can use prodigy to annotate the spans for my dataset but I am not sure of what the format of the source data should be. Like I see this cmd command on the webpage: Built-in Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP but what is the format of the source dataset? can a text file work or does it have to be json? are there specific keys that should exist in the json file for the spans and entities from prodigy to be incorporated into the file?

ryanwesslen · September 21, 2022, 8:05am

hi @fariiaakh!

Thanks for your question and welcome to the Prodigy community

You can find more details on the source data format on the Input Data section of the documentation.

As the docs provide, the easiest data file format is a .txt file with each document indicated by a new line.

# data.txt
This is a sentence.
This is another sentence.

However, .jsonl is the preferred data format. You will need to put each document as a key-value pair with "text".

# data.jsonl
{"text": "This is a sentence."}
{"text": "This is another sentence.", "meta": {"score": 0.1}}

The "meta" key is optional. Values in this will appear on the user-interface.

Yes for annotated data there are keys for spans (see below). However, these are not needed for unannotated data (e.g., how could you have spans/entities if you haven't annotated the data yet?).

Two other items that may help you from the documentation.

First, be sure to see the Annotation Interfaces, like ner.manual that provides details for each user interface and what to expected data output from the UI. Use this as a template if you have previously annotations that you want to use. For example, this is the output data format for ner.manual:

{
  "text": "First look at the new MacBook Pro",
  "spans": [
    {"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}
  ],
  "tokens": [
    {"text": "First", "start": 0, "end": 5, "id": 0},
    {"text": "look", "start": 6, "end": 10, "id": 1},
    {"text": "at", "start": 11, "end": 13, "id": 2},
    {"text": "the", "start": 14, "end": 17, "id": 3},
    {"text": "new", "start": 18, "end": 21, "id": 4},
    {"text": "MacBook", "start": 22, "end": 29, "id": 5},
    {"text": "Pro", "start": 30, "end": 33, "id": 6}
  ]
}

Second, after you have annotated data, you can pull examples from your annotated datasets using this code:

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("my_dataset")

You can then view an example of your data so you can get a better understanding of its structure.

Let me know if you have any further questions!

Topic		Replies	Views
Converting data to Prodigy's format Getting Started usage , ner	1	1566	December 5, 2018
Updating an NER model using the annotation tool ner , spacy	6	400	June 5, 2023
Text corpus .txt file to json/spacy format file usage , spacy , solved	5	1323	July 2, 2021
Create a jsonl pre-populated with annoatations from .txt file usage , ner	4	1074	March 1, 2021
Names only for annotation project usage , ner	1	356	May 8, 2021

NER manual source data fomat

Related topics