need help in creating own jsonl file for training the model

joshiag · January 29, 2019, 2:00pm

Hello, I would like to know the structure and format for josnl file from which I can train my model. I tried formats on the website, but are not working. Its like I would like to train my model with all names of the countries, states/provinces, cities etc. Please let me know how I can achieve this?

ines · January 29, 2019, 4:36pm

Your PRODIGY_README.html (available for download with Prodigy) has an “Input formats” section that shows the expected format of the data files you can load in, and a section “Annotation task formats”, which shows the format of the labelled examples Prodigy stores in the database.

Data you load in (to annotate it with a recipe like ner.teach) should ideally be a JSONL file with one dictionary/object per line and a "text" key. For example:

{"text": "This is a text"}
{"text": "This is another text"}

What exactly are you trying to do? Which recipe are you running, and what errors did you see when you loaded your data?

joshiag · January 30, 2019, 6:52am

I am trying to use {“I am from Sangli” ,{“entities”: [[11, 17 , “GPE” ]]} this format with ner.make-gold and also ner.batch-train recipe, but i am getting error:
ValueError: Failed to load task (invalid JSON).

{“I am from Sangli” ,{“entities”: [[11, 17 , “GPE” … m from Sangli" ,{“entities”: [[11, 17 , “GPE” ]]}

ines · January 30, 2019, 12:05pm

Yes, threre are 2 problems here:

It’s invalid JSON. You’re using curly braces around values separated by a comma, e.g {"foo", "bar"}.
It’s not the format expected by Prodigy. Prodigy notes highlighted spans as a list of "spans". See the “Annotation task formats” section in your PRODIGY_README.html for details. For NER, an incoming task could look like this:

{
    "text": "Apple updates its analytics service with new metrics",
    "spans": [
        {"start": 0, "end": 5, "label": "ORG"}
    ]
}

joshiag · January 31, 2019, 5:58am

Hello, Thank you for the reply. Now we tried with the format suggested by you as
{“text”: “I am from Sangli”, “spans”: [ {“start”: 11, “end”:17, “label”: “GPE”}]}
{“text”: “I am from Satara”, “spans”: [ {“start”: 11, “end”:17, “label”: “GPE”}]}
now we are getting error with receipe
prodigy ner.teach prodata en_en_pro_web_sm test.jsonl
as
File “cython_src/prodigy/components/preprocess.pyx”, line 143, in prodigy.components.preprocess._add_tokens
KeyError: 11

Please suggest how do we proceed?

Thank you

ines · January 31, 2019, 10:32am

Can you try setting --unsegmented and see if that solves it?

joshiag · February 1, 2019, 6:48am

Hello,
Tried with --unsegmented. Now getting following error

in prodigy.models.ner.EntityRecognizer.call.get_tasks.sort_by_entity
KeyError: ‘start’

ines · February 1, 2019, 9:48am

Hmmm, I haven’t seen that erro before. Can you double-check that all entries in "spans" define a "start", "end" and "label"?

I also wonder if what you’re trying to do will even work using ner.make-gold – that recipe sets its own entity annotations, so it might actually just overwrite what you already have in the data. You probably want to use ner.manual instead if you want to feed in pre-labelled examples.

joshiag · February 2, 2019, 5:12am

As you suggested, I checked all entries in span, start, end and label. But facing same error even with make-gold by passing the jsonl to recipe. My problem is I have to train my model with very large data and manual or correcting in make-gold will consume our lot of time. So if you can help us training model with pre defined labels will save humongous efforts and the accuracy of model will be high.

thanks

ines · February 2, 2019, 3:38pm

Oh okay – I mean, Prodigy is an annotation tool, so the point of it is always to… actually annotate data, even if it’s just double-checking. If you already have annotations and just want to train a model, you might be better off using spaCy directly. See the training docs and spacy train for details.

Topic		Replies	Views
Having problems with file during ner.manual (Error while validating stream: no first example) usage , solved , streams	3	873	November 16, 2021
Create a jsonl pre-populated with annoatations from .txt file usage , ner	4	1068	March 1, 2021
Cant load pre-annotated ner jsonl usage , ner , solved	8	1182	June 24, 2020
Need to create a jsonl file on python according to certain format usage , third-party	1	810	October 2, 2019
HTML to jsonl and NER task workflow usage , ner , solved	6	851	July 19, 2019

need help in creating own jsonl file for training the model

Related topics