prodigy.components.preprocess.split_sentences

I saw that in the batch-train recipe you use the function prodigy.components.preprocess.split_sentences to split examples that consist of multiple sentences. I could not find any documentation of it and I’m a bit confused by what it does.

When I execute the following snippets, where evals is from a dataset that was created with ner.teach:

item = copy.deepcopy(evals[0])
print(item)
item['text'] = 'Asp - or ash? Or something completely else?'
out = list(prodigy.components.preprocess.split_sentences(nlp, [item]))
print(out)

I get the following output and I’m confused by it. If I split one sentence into two, wouldn’t I also have to modify the start and end values of all spans? The second sentence now has spans attached that don’t make sense for it.

{
  "text": "Asp – or ash?",
  "spans": [
    {
      "start": 0,
      "end": 3,
      "text": "Asp",
      "rank": 0,
      "label": "DISASTER",
      "score": 0.35136231090000003,
      "source": "core_web_lg",
      "input_hash": -1988687967,
      "answer": "reject"
    },
    {
      "text": "ash",
      "start": 9,
      "end": 12,
      "label": "DISASTER",
      "priority": 0.7142857313000001,
      "score": 0.7142857313000001,
      "pattern": 51,
      "answer": "reject"
    }
  ],
  "meta": {
    "source": "The Guardian",
    "section": "Science",
    "score": 0.35136231090000003
  },
  "_input_hash": -1988687967,
  "_task_hash": 1779273069,
  "answer": "reject"
}

[
  {
    "text": "Asp - or ash?",
    "spans": [
      {
        "start": 0,
        "end": 3,
        "text": "Asp",
        "rank": 0,
        "label": "DISASTER",
        "score": 0.35136231090000003,
        "source": "core_web_lg",
        "input_hash": -1988687967,
        "answer": "reject"
      },
      {
        "text": "ash",
        "start": 9,
        "end": 12,
        "label": "DISASTER",
        "priority": 0.7142857313000001,
        "score": 0.7142857313000001,
        "pattern": 51,
        "answer": "reject"
      }
    ],
    "meta": {
      "source": "The Guardian",
      "section": "Science",
      "score": 0.35136231090000003
    },
    "_input_hash": 714256756,
    "_task_hash": -876779732,
    "answer": "reject"
  },
  {
    "text": "Or something completely else?",
    "spans": [
      {
        "start": 0,
        "end": 3,
        "text": "Asp",
        "rank": 0,
        "label": "DISASTER",
        "score": 0.35136231090000003,
        "source": "core_web_lg",
        "input_hash": -1988687967,
        "answer": "reject"
      },
      {
        "text": "ash",
        "start": 9,
        "end": 12,
        "label": "DISASTER",
        "priority": 0.7142857313000001,
        "score": 0.7142857313000001,
        "pattern": 51,
        "answer": "reject"
      }
    ],
    "meta": {
      "source": "The Guardian",
      "section": "Science",
      "score": 0.35136231090000003
    },
    "_input_hash": -1897717309,
    "_task_hash": 2241477,
    "answer": "reject"
  }
]

Thanks for the report – and I could have sworn split_sentences was in the docs, but turns out it’s not. Sorry, will fix this.

The function does pretty much exactly what you think it does – it uses spaCy to split the text into sentences, and then iterates over the spans and adjusts the span start and end positions accordingly. So I’m pretty confused that it produces the result you shared – the preprocessor iterates over the sentences, deepcopies the example for each sentence, overwrites the text, adds all adjusted spans to an empty list if they’re within the sentence boundaries and then overwrites the example’s "spans" property.

Thanks for including your example btw – this will make it a lot easier to try and reproduce the problem and see what’s going on.

Nice @ines, I’m glad to help. Especially given how much easier my life gets with SpaCy and Prodigy :slight_smile:

Please keep me updated regarding your investigation.

1 Like