Section Heading Identificaton. Expert cohort sense check.

Hi @HAL9000 ,

If your section are always in the same order you can, indeed, just identify the headings (using a classifier or patterns) and then select the chunks of text in between. If you are using patterns to identify headers, then you don't actually need Prodigy (as you won't be training any models). You might be able to implement that, for example, as a spaCy pipeline with custom components.

One step up would be to train the model for identifying the headers. Depending on the quality of the data and the annotations, the model should give you better recall than patterns. For that you could use Prodigy span.manual recipe to annotate the headers. Once the headers are identified, again, you could just select the text in between as a postprocessing step.

If you would like to do the extraction in one step, you might train a text classifier to classify each sentence or even paragraph in this case to the section it belongs. For that you could use Prodigy textcat with multiple options (one option per section) or binary textcat where you would have multiple (but really speedy) passes over the data each time asking a question whether a given paragraph belongs to the section in question.

This post discusses a problem very similar to yours in case you want to have a look: what is best way to to extract paragraph or long sentences in a text document?