JSONL with annotation for NET multi-tag for newbies

bbkudk · February 13, 2022, 11:57pm

Hi.
I am currently using Doccano for NER multi-label annotation. I was impressed with Prodigy's functionality and considering buying it. I watched the NER video and understood the basic concept but did not follow many details because I'm new to Python & Prodigy.

Can you please help me with a very simple guide with easy steps without professional and complicated terminology on how do I import my existing JSONL file with dataset + my annotations in JSONL file format exported from Doccano? Example of my JSONL:

{"id" : 1613, "data": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555", "label": [[0, 6, "COMPANYNAME"], [18, 22, "DATE"]]}

A very simplified step-by-step guide as for a child, not overloaded with lots of text, information and terminology would be much appreciated. It's easier to learn something new when you can succeed with simple tasks. Manuals are often not self-contained and assume knowledge of programming languages, technology, and terminology.

I have found this conversation, but it didn't help.
Cant load pre-annotated ner jsonl - Prodigy Support

Hope for your help,
Best, D

ljvmiranda921 · February 14, 2022, 2:46am

Hi @bbkudk , welcome to Prodigy!

Similar to Doccano, Prodigy also allows JSONL as input. From your example, it seems that following the ner format will be the simplest. So assuming you have this:

: 1613, "data": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555", "label": [[0, 6, "COMPANYNAME"], [18, 22, "DATE"]]}

With the following fields:

data: DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555
spans: An entity COMPANYNAME that starts from character 0 and ends at 6, and a DATE entity that starts from character 18 and ends at 22.

(I'm not quite sure what 1613 means, for now I'll assume its an index and I'll deem it irrelevant)

What you need to do now is rearrange your data into its corresponding Prodigy task format:

{
"text": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555",
"spans": [
        {"start": 0, "end": 6, "label": "COMPANYNAME"}, 
        {"start": 18, "end": 22, "label": "DATE"}
    ]
}

It's the same information, but just in a different format. You have to do this for each text in your dataset, and you have to save them in JSONL. Usually, a JSONL file has one example per line. I only indented the text above for clarity, but you want to have something like this:

# Sample JSONL file - one sample per file
{"text": ....
{"text": ...
...

You can achieve this programmatically through Python by iterating through all your texts and formatting them inside a dictionary.

Once you have your JSONL file (we'll call it source.jsonl for now), you can now use prodigy and start annotating and correcting your labels. You can use either the ner.manual ("I just want to label them by myself, I don't need any help from a model") or ner.correct ("I want to correct my annotations with a model actively helping me") for annotation. The simplest case may be to just use ner.manual:

prodigy ner.manual my_dataset blank:en ./source.jsonl --label COMPANYNAME,DATE

With the following parameters:

ner.manual: the recipe to use. There are a lot of recipes for NER, and you can choose any based on your task
my_dataset: the dataset name where your annotations will be saved
blank:en: the spaCy model for tokenization. You can use a trained pipeline or just a blank model.
./source.jsonl: the data source from Doccano that you've formatted into Prodigy
--label COMPANYNAME,DATE: the labels you'll use for annotation

If everything goes well, then you can head to your browser (localhost:8080) and start annotating
Hopefully it clears things up, feel free to follow up if you have any more questions!

ljvmiranda921 · February 14, 2022, 2:48am

Now, if you just want to import existing data and not do any annotations etc. You can use the db-in command:

prodigy db-in my_dataset ./source.jsonl

bbkudk · February 14, 2022, 1:22pm

Thanks for the detailed explanation! @ljvmiranda921

Topic		Replies	Views
need help in creating own jsonl file for training the model usage , solved	9	2684	February 2, 2019
Create a jsonl pre-populated with annoatations from .txt file usage , ner	4	1071	March 1, 2021
Convert output of spaCy PhraseMatcher to prodigy JSONL ner , spacy , solved	3	1144	May 3, 2021
Cant load pre-annotated ner jsonl usage , ner , solved	8	1183	June 24, 2020
how to extend an already labeled corpus? usage , ner , solved	5	1085	June 29, 2019

JSONL with annotation for NET multi-tag for newbies

Related topics