JSONL with annotation for NET multi-tag for newbies

Hi.
I am currently using Doccano for NER multi-label annotation. I was impressed with Prodigy's functionality and considering buying it. I watched the NER video and understood the basic concept but did not follow many details because I'm new to Python & Prodigy.

Can you please help me with a very simple guide with easy steps without professional and complicated terminology on how do I import my existing JSONL file with dataset + my annotations in JSONL file format exported from Doccano? Example of my JSONL:

{"id" : 1613, "data": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555", "label": [[0, 6, "COMPANYNAME"], [18, 22, "DATE"]]}

A very simplified step-by-step guide as for a child, not overloaded with lots of text, information and terminology would be much appreciated. It's easier to learn something new when you can succeed with simple tasks. Manuals are often not self-contained and assume knowledge of programming languages, technology, and terminology.

I have found this conversation, but it didn't help.
Cant load pre-annotated ner jsonl - Prodigy Support

Hope for your help,
Best, D

Hi @bbkudk , welcome to Prodigy!

Similar to Doccano, Prodigy also allows JSONL as input. From your example, it seems that following the ner format will be the simplest. So assuming you have this:

: 1613, "data": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555", "label": [[0, 6, "COMPANYNAME"], [18, 22, "DATE"]]}

With the following fields:

  • data: DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555
  • spans: An entity COMPANYNAME that starts from character 0 and ends at 6, and a DATE entity that starts from character 18 and ends at 22.

(I'm not quite sure what 1613 means, for now I'll assume its an index and I'll deem it irrelevant)

  1. What you need to do now is rearrange your data into its corresponding Prodigy task format:
{
"text": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555",
"spans": [
        {"start": 0, "end": 6, "label": "COMPANYNAME"}, 
        {"start": 18, "end": 22, "label": "DATE"}
    ]
}

It's the same information, but just in a different format. You have to do this for each text in your dataset, and you have to save them in JSONL. Usually, a JSONL file has one example per line. I only indented the text above for clarity, but you want to have something like this:

# Sample JSONL file - one sample per file
{"text": ....
{"text": ...
...

You can achieve this programmatically through Python by iterating through all your texts and formatting them inside a dictionary.

  1. Once you have your JSONL file (we'll call it source.jsonl for now), you can now use prodigy and start annotating and correcting your labels. You can use either the ner.manual ("I just want to label them by myself, I don't need any help from a model") or ner.correct ("I want to correct my annotations with a model actively helping me") for annotation. The simplest case may be to just use ner.manual:
prodigy ner.manual my_dataset blank:en ./source.jsonl --label COMPANYNAME,DATE

With the following parameters:

  • ner.manual: the recipe to use. There are a lot of recipes for NER, and you can choose any based on your task
  • my_dataset: the dataset name where your annotations will be saved
  • blank:en: the spaCy model for tokenization. You can use a trained pipeline or just a blank model.
  • ./source.jsonl: the data source from Doccano that you've formatted into Prodigy
  • --label COMPANYNAME,DATE: the labels you'll use for annotation

If everything goes well, then you can head to your browser (localhost:8080) and start annotating :slight_smile:
Hopefully it clears things up, feel free to follow up if you have any more questions!

1 Like

Now, if you just want to import existing data and not do any annotations etc. You can use the db-in command:

prodigy db-in my_dataset ./source.jsonl
1 Like

Thanks for the detailed explanation! @ljvmiranda921