need help in creating own jsonl file for training the model

Hello, I would like to know the structure and format for josnl file from which I can train my model. I tried formats on the website, but are not working. Its like I would like to train my model with all names of the countries, states/provinces, cities etc. Please let me know how I can achieve this?

Your PRODIGY_README.html (available for download with Prodigy) has an “Input formats” section that shows the expected format of the data files you can load in, and a section “Annotation task formats”, which shows the format of the labelled examples Prodigy stores in the database.

Data you load in (to annotate it with a recipe like ner.teach) should ideally be a JSONL file with one dictionary/object per line and a "text" key. For example:

{"text": "This is a text"}
{"text": "This is another text"}

What exactly are you trying to do? Which recipe are you running, and what errors did you see when you loaded your data?

I am trying to use {“I am from Sangli” ,{“entities”: [[11, 17 , “GPE” ]]} this format with ner.make-gold and also ner.batch-train recipe, but i am getting error:
ValueError: Failed to load task (invalid JSON).

{“I am from Sangli” ,{“entities”: [[11, 17 , “GPE” … m from Sangli" ,{“entities”: [[11, 17 , “GPE” ]]}

Yes, threre are 2 problems here:

  1. It’s invalid JSON. You’re using curly braces around values separated by a comma, e.g {"foo", "bar"}.
  2. It’s not the format expected by Prodigy. Prodigy notes highlighted spans as a list of "spans". See the “Annotation task formats” section in your PRODIGY_README.html for details. For NER, an incoming task could look like this:
{
    "text": "Apple updates its analytics service with new metrics",
    "spans": [
        {"start": 0, "end": 5, "label": "ORG"}
    ]
}

Hello, Thank you for the reply. Now we tried with the format suggested by you as
{“text”: “I am from Sangli”, “spans”: [ {“start”: 11, “end”:17, “label”: “GPE”}]}
{“text”: “I am from Satara”, “spans”: [ {“start”: 11, “end”:17, “label”: “GPE”}]}
now we are getting error with receipe
prodigy ner.teach prodata en_en_pro_web_sm test.jsonl
as
File “cython_src/prodigy/components/preprocess.pyx”, line 143, in prodigy.components.preprocess._add_tokens
KeyError: 11

Please suggest how do we proceed?

Thank you

Can you try setting --unsegmented and see if that solves it?

Hello,
Tried with --unsegmented. Now getting following error

in prodigy.models.ner.EntityRecognizer.call.get_tasks.sort_by_entity
KeyError: ‘start’

Hmmm, I haven’t seen that erro before. Can you double-check that all entries in "spans" define a "start", "end" and "label"?

I also wonder if what you’re trying to do will even work using ner.make-gold – that recipe sets its own entity annotations, so it might actually just overwrite what you already have in the data. You probably want to use ner.manual instead if you want to feed in pre-labelled examples.

As you suggested, I checked all entries in span, start, end and label. But facing same error even with make-gold by passing the jsonl to recipe. My problem is I have to train my model with very large data and manual or correcting in make-gold will consume our lot of time. So if you can help us training model with pre defined labels will save humongous efforts and the accuracy of model will be high.

thanks

Oh okay – I mean, Prodigy is an annotation tool, so the point of it is always to… actually annotate data, even if it’s just double-checking. If you already have annotations and just want to train a model, you might be better off using spaCy directly. See the training docs and spacy train for details.