prodigy.components.preprocess.split_sentences

Stephan · January 24, 2018, 9:28am

I saw that in the batch-train recipe you use the function prodigy.components.preprocess.split_sentences to split examples that consist of multiple sentences. I could not find any documentation of it and I’m a bit confused by what it does.

When I execute the following snippets, where evals is from a dataset that was created with ner.teach:

item = copy.deepcopy(evals[0])
print(item)
item['text'] = 'Asp - or ash? Or something completely else?'
out = list(prodigy.components.preprocess.split_sentences(nlp, [item]))
print(out)

I get the following output and I’m confused by it. If I split one sentence into two, wouldn’t I also have to modify the start and end values of all spans? The second sentence now has spans attached that don’t make sense for it.

{
  "text": "Asp – or ash?",
  "spans": [
    {
      "start": 0,
      "end": 3,
      "text": "Asp",
      "rank": 0,
      "label": "DISASTER",
      "score": 0.35136231090000003,
      "source": "core_web_lg",
      "input_hash": -1988687967,
      "answer": "reject"
    },
    {
      "text": "ash",
      "start": 9,
      "end": 12,
      "label": "DISASTER",
      "priority": 0.7142857313000001,
      "score": 0.7142857313000001,
      "pattern": 51,
      "answer": "reject"
    }
  ],
  "meta": {
    "source": "The Guardian",
    "section": "Science",
    "score": 0.35136231090000003
  },
  "_input_hash": -1988687967,
  "_task_hash": 1779273069,
  "answer": "reject"
}

[
  {
    "text": "Asp - or ash?",
    "spans": [
      {
        "start": 0,
        "end": 3,
        "text": "Asp",
        "rank": 0,
        "label": "DISASTER",
        "score": 0.35136231090000003,
        "source": "core_web_lg",
        "input_hash": -1988687967,
        "answer": "reject"
      },
      {
        "text": "ash",
        "start": 9,
        "end": 12,
        "label": "DISASTER",
        "priority": 0.7142857313000001,
        "score": 0.7142857313000001,
        "pattern": 51,
        "answer": "reject"
      }
    ],
    "meta": {
      "source": "The Guardian",
      "section": "Science",
      "score": 0.35136231090000003
    },
    "_input_hash": 714256756,
    "_task_hash": -876779732,
    "answer": "reject"
  },
  {
    "text": "Or something completely else?",
    "spans": [
      {
        "start": 0,
        "end": 3,
        "text": "Asp",
        "rank": 0,
        "label": "DISASTER",
        "score": 0.35136231090000003,
        "source": "core_web_lg",
        "input_hash": -1988687967,
        "answer": "reject"
      },
      {
        "text": "ash",
        "start": 9,
        "end": 12,
        "label": "DISASTER",
        "priority": 0.7142857313000001,
        "score": 0.7142857313000001,
        "pattern": 51,
        "answer": "reject"
      }
    ],
    "meta": {
      "source": "The Guardian",
      "section": "Science",
      "score": 0.35136231090000003
    },
    "_input_hash": -1897717309,
    "_task_hash": 2241477,
    "answer": "reject"
  }
]

ines · January 24, 2018, 1:31pm

Thanks for the report – and I could have sworn split_sentences was in the docs, but turns out it’s not. Sorry, will fix this.

The function does pretty much exactly what you think it does – it uses spaCy to split the text into sentences, and then iterates over the spans and adjusts the span start and end positions accordingly. So I’m pretty confused that it produces the result you shared – the preprocessor iterates over the sentences, deepcopies the example for each sentence, overwrites the text, adds all adjusted spans to an empty list if they’re within the sentence boundaries and then overwrites the example’s "spans" property.

Thanks for including your example btw – this will make it a lot easier to try and reproduce the problem and see what’s going on.

Stephan · January 24, 2018, 1:40pm

Nice @ines, I’m glad to help. Especially given how much easier my life gets with SpaCy and Prodigy

Please keep me updated regarding your investigation.

Topic		Replies	Views
Partially Fixed: ner.batch-train's split_sentences does not properly handle tokens and spans ner , done	1	504	October 1, 2018
prodigy splitting sentences for annotation enhancement , usage , done	14	3453	December 12, 2019
Prodigy sentence splitting during ner.correct usage , ner , spacy	3	428	February 24, 2021
Sentence-based classification: Automated sentence splitting? usage , textcat , spacy , solved	5	1834	June 14, 2018
How to split the paragraph into sentences after annotation ner	3	591	November 20, 2022

prodigy.components.preprocess.split_sentences

Related topics