I saw that in the batch-train
recipe you use the function prodigy.components.preprocess.split_sentences
to split examples that consist of multiple sentences. I could not find any documentation of it and I’m a bit confused by what it does.
When I execute the following snippets, where evals
is from a dataset that was created with ner.teach
:
item = copy.deepcopy(evals[0])
print(item)
item['text'] = 'Asp - or ash? Or something completely else?'
out = list(prodigy.components.preprocess.split_sentences(nlp, [item]))
print(out)
I get the following output and I’m confused by it. If I split one sentence into two, wouldn’t I also have to modify the start
and end
values of all spans? The second sentence now has spans attached that don’t make sense for it.
{
"text": "Asp – or ash?",
"spans": [
{
"start": 0,
"end": 3,
"text": "Asp",
"rank": 0,
"label": "DISASTER",
"score": 0.35136231090000003,
"source": "core_web_lg",
"input_hash": -1988687967,
"answer": "reject"
},
{
"text": "ash",
"start": 9,
"end": 12,
"label": "DISASTER",
"priority": 0.7142857313000001,
"score": 0.7142857313000001,
"pattern": 51,
"answer": "reject"
}
],
"meta": {
"source": "The Guardian",
"section": "Science",
"score": 0.35136231090000003
},
"_input_hash": -1988687967,
"_task_hash": 1779273069,
"answer": "reject"
}
[
{
"text": "Asp - or ash?",
"spans": [
{
"start": 0,
"end": 3,
"text": "Asp",
"rank": 0,
"label": "DISASTER",
"score": 0.35136231090000003,
"source": "core_web_lg",
"input_hash": -1988687967,
"answer": "reject"
},
{
"text": "ash",
"start": 9,
"end": 12,
"label": "DISASTER",
"priority": 0.7142857313000001,
"score": 0.7142857313000001,
"pattern": 51,
"answer": "reject"
}
],
"meta": {
"source": "The Guardian",
"section": "Science",
"score": 0.35136231090000003
},
"_input_hash": 714256756,
"_task_hash": -876779732,
"answer": "reject"
},
{
"text": "Or something completely else?",
"spans": [
{
"start": 0,
"end": 3,
"text": "Asp",
"rank": 0,
"label": "DISASTER",
"score": 0.35136231090000003,
"source": "core_web_lg",
"input_hash": -1988687967,
"answer": "reject"
},
{
"text": "ash",
"start": 9,
"end": 12,
"label": "DISASTER",
"priority": 0.7142857313000001,
"score": 0.7142857313000001,
"pattern": 51,
"answer": "reject"
}
],
"meta": {
"source": "The Guardian",
"section": "Science",
"score": 0.35136231090000003
},
"_input_hash": -1897717309,
"_task_hash": 2241477,
"answer": "reject"
}
]