Hi @bbkudk , welcome to Prodigy!
Similar to Doccano, Prodigy also allows JSONL as input. From your example, it seems that following the ner
format will be the simplest. So assuming you have this:
: 1613, "data": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555", "label": [[0, 6, "COMPANYNAME"], [18, 22, "DATE"]]}
With the following fields:
- data:
DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555
- spans: An entity
COMPANYNAME
that starts from character 0 and ends at 6, and a DATE
entity that starts from character 18 and ends at 22.
(I'm not quite sure what 1613
means, for now I'll assume its an index and I'll deem it irrelevant)
- What you need to do now is rearrange your data into its corresponding Prodigy task format:
{
"text": "DUNKIN 351099 Q35 10/24 PURCHASE FEASTERVILLE PA DEBIT CARD *6555",
"spans": [
{"start": 0, "end": 6, "label": "COMPANYNAME"},
{"start": 18, "end": 22, "label": "DATE"}
]
}
It's the same information, but just in a different format. You have to do this for each text in your dataset, and you have to save them in JSONL. Usually, a JSONL file has one example per line. I only indented the text above for clarity, but you want to have something like this:
# Sample JSONL file - one sample per file
{"text": ....
{"text": ...
...
You can achieve this programmatically through Python by iterating through all your texts and formatting them inside a dictionary.
- Once you have your JSONL file (we'll call it
source.jsonl
for now), you can now use prodigy
and start annotating and correcting your labels. You can use either the ner.manual
("I just want to label them by myself, I don't need any help from a model") or ner.correct
("I want to correct my annotations with a model actively helping me") for annotation. The simplest case may be to just use ner.manual
:
prodigy ner.manual my_dataset blank:en ./source.jsonl --label COMPANYNAME,DATE
With the following parameters:
-
ner.manual
: the recipe to use. There are a lot of recipes for NER, and you can choose any based on your task
-
my_dataset
: the dataset name where your annotations will be saved
-
blank:en
: the spaCy model for tokenization. You can use a trained pipeline or just a blank model.
-
./source.jsonl
: the data source from Doccano that you've formatted into Prodigy
-
--label COMPANYNAME,DATE
: the labels you'll use for annotation
If everything goes well, then you can head to your browser (localhost:8080
) and start annotating
Hopefully it clears things up, feel free to follow up if you have any more questions!