While I was exploring this further I noticed something interesting about calling the way I've been callingrel.manual
locally.
This line seems to cause the rendering error on my machine.
python -m prodigy rel.manual sm-ref-rel blank:it their-example.jsonl --label PARENT,SIBLING,SAME_AS --span-label J-REF,L-REF
But this one does not.
python -m prodigy rel.manual sm-ref-rel blank:it their-example.jsonl --label FOO --span-label BAR
The difference is in the label names. This led me to dive a bit deeper into your json file.
Before, my assumption was that this issue was related to the tokenizer, but now I see that you have some spans corresponding with the large token in your GIF. Note that these spans carry the J-REF
or the L-REF
label.
Here are the spans that I found:
"spans": [
{
"start": 1718,
"end": 1764,
"text": "Sez. U n. 12 del 31/5/2000, Jakani, Rv. 216260",
"source": "./models/les_model_ref_tmp_3/model-best",
"input_hash": 1945267493,
"token_start": 275,
"token_end": 287,
"label": "J-REF"
},
{
"start": 1766,
"end": 1819,
"token_start": 289,
"token_end": 301,
"label": "J-REF"
},
{
"start": 1906,
"end": 1914,
"text": "art. 616",
"source": "./models/les_model_ref_tmp_3/model-best",
"input_hash": 1945267493,
"token_start": 319,
"token_end": 320,
"label": "L-REF"
},
{
"start": 1915,
"end": 1921,
"text": "c.p.p.",
"source": "./models/les_model_ref_tmp_3/model-best",
"input_hash": 1945267493,
"token_start": 321,
"token_end": 322,
"label": "L-REF"
}
],
This would explain why you see some of the tokens "clump" together. It does not explain why you see a difference between macOS and Ubuntu. So that's certainly still worth investigating further.
A temporary solution for you might be to run the command without referring to the spans
currently in your json
file. Via something like;
python -m prodigy rel.manual sm-ref-rel blank:it their-example.jsonl --label PARENT,SIBLING,SAME_AS --span-label FOOBAR
I want to admit that this feels a bit like a "hack". Effectively, we'd be pretending like there's a span with the label FOOBAR
such that Prodigy ignores all the other spans in the dictionary.
A better solution for the long term, possibly, would be to filter out the spans beforehand for this labelling task. This would, however, require some manual work via the db-out
command. This feels like a better solution in the long term, but I may be missing some context on the bigger picture to know for sure.
Could you confirm if the issue persists when you run the command with the --span-label FOOBAR
setting? I'd love to confirm if the spans in the json
file are the culprit of the behavior.