I just ran spancat.manual
on this single jsonl example.
{"text": "<i> ( d\u00e1le jen objednatel ) </i>"}
After labelling the name, I saved the label and quit prodigy. Next, I ran prodigy db-out
to have a look at the data format that Prodigy provides.
{
"text":"<i> ( d\u00e1le jen objednatel ) </i>",
"_input_hash":-1815072948,
"_task_hash":1525844819,
"tokens":[
{"text":"<","start":0,"end":1,"id":0,"ws":false},
{"text":"i","start":1,"end":2,"id":1,"ws":false},
{"text":">","start":2,"end":3,"id":2,"ws":true},
{"text":"(","start":4,"end":5,"id":3,"ws":true},
{"text":"d\u00e1le","start":6,"end":10,"id":4,"ws":true},
{"text":"jen","start":11,"end":14,"id":5,"ws":true},
{"text":"objednatel","start":15,"end":25,"id":6,"ws":true},
{"text":")","start":26,"end":27,"id":7,"ws":true},
{"text":"<","start":28,"end":29,"id":8,"ws":false},
{"text":"/i","start":29,"end":31,"id":9,"ws":false},
{"text":">","start":31,"end":32,"id":10,"ws":false}
],
"_view_id":"spans_manual",
"spans":[
{"start":6,"end":25,"token_start":4,"token_end":6,"label":"NAME"}
],
"answer":"accept",
"_timestamp":1650876701
}
Just to check, did I label the same thing as you? Could you share the tokens in your jsonl
file as well? In general, I recommend parsing out the HTML before labelling it because the spaCy tokeniser might have a hard time dealing with HTML.
I am able to refer to the dataset with this example that seems to run just fine.
> prodigy train --spancat spandemo
ℹ Using CPU
========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
Using 'spacy.ngram_range_suggester.v1' for 'spancat' with sizes 3 to 3 (inferred from data)
✔ Generated training config
=========================== Initializing pipeline ===========================
[2022-04-25 10:59:00,030] [INFO] Set up nlp object from config
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (1)
[2022-04-25 10:59:00,039] [INFO] Pipeline: ['spancat']
[2022-04-25 10:59:00,042] [INFO] Created vocabulary
[2022-04-25 10:59:00,042] [INFO] Finished initializing nlp object
[2022-04-25 10:59:00,064] [INFO] Initialized pipeline components: ['spancat']
✔ Initialized pipeline
============================= Training pipeline =============================
Components: spancat
Merging training and evaluation data for 1 components
- [spancat] Training: 1 | Evaluation: 0 (20% split)
Training: 1 | Evaluation: 0
Labels: spancat (1)
ℹ Pipeline: ['spancat']
ℹ Initial learn rate: 0.001
E # LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE
--- ------ ------------ ---------- ---------- ---------- ------
0 0 1.80 0.00 0.00 0.00 0.00
200 200 1.17 0.00 0.00 0.00 0.00
400 400 0.00 0.00 0.00 0.00 0.00