Section Heading Identificaton. Expert cohort sense check.

Can I have a sense check from the cohort of experts here that I am approaching this the correct way.

I have a series of reports which have a similar general structure with headings:

Heading Type1:Introduction (+synonyms/variants)
Free Text Blah blah blah blah blah.

Heading Type2:Findings (+synonyms/variants)
Free Text Blah blah blah blah blah.

Heading Type3:Results (+synonyms/variants)
Free Text Blah blah blah blah blah.

Heading Type4:Conclusion (+synonyms/variants)
Free Text Blah blah blah blah blah.

I want to extract the free text of below each heading.

My plan was to use Prodigy with some of the reports and labels all the variants of the above titles into the same label group tie Heading 4:Conclusion would include Impression, Final Opinion etc etc and all the typographical variations

Having grouped the sections I could then extract the Free text between the sections from the position of the labels. Is this the smart way to do this or should I make the text below a child of the section. (not sure how to that simply to be honest).

The question I am asking is this a sensible way of doing it? Is there a better way to do it. Feel free to tell me this is daft or propose a smarter way. Any better examples would be appreciated.



Hi @HAL9000 ,

If your section are always in the same order you can, indeed, just identify the headings (using a classifier or patterns) and then select the chunks of text in between. If you are using patterns to identify headers, then you don't actually need Prodigy (as you won't be training any models). You might be able to implement that, for example, as a spaCy pipeline with custom components.

One step up would be to train the model for identifying the headers. Depending on the quality of the data and the annotations, the model should give you better recall than patterns. For that you could use Prodigy span.manual recipe to annotate the headers. Once the headers are identified, again, you could just select the text in between as a postprocessing step.

If you would like to do the extraction in one step, you might train a text classifier to classify each sentence or even paragraph in this case to the section it belongs. For that you could use Prodigy textcat with multiple options (one option per section) or binary textcat where you would have multiple (but really speedy) passes over the data each time asking a question whether a given paragraph belongs to the section in question.

This post discusses a problem very similar to yours in case you want to have a look: what is best way to to extract paragraph or long sentences in a text document?

Thanks so much. Will have a look at the example you sent.

This is a rough structure of the reports I want to analyse. The colour codes is consistent with the examples. You have suggested 2 options.

The first was span.maual to annotate the headers. Then find the text in between as a post processing step. I would guess I would have to make some rules here in the example below which has a missing conclusion section.

The second method you suggested was textcat. From what I have seen with textcat it would it pull random sections/sentences/paragraphs in isolation and ask me to label them. This would be difficult without there position show in the full text for context. Is this what happens? I would be happy like with NER/SpanCat just labelling the entire document.

What do you think from the examples I have provided? I hope this makes sense.

what do you think magdaaniol ?

Hi @HAL9000 ,

Given that the order of the sections is not always the same plus there's a variation in how they are named, relying on rules there might be error prone. Conversely, the language of each section should be significantly different from the other ones which means that text classification should probably work well.

The thing is that NER and spancat are meant for shorter spans - not long and complex ones, not to mention entire paragraphs. For longer textspans, text classification tends to yield better results.

I think it would be best to try to preprocess the texts so that you have one paragraph per Prodigy example (this should give enough context to the annotator and the model). To facilitate the task you could add the name of the the section in meta attribute of each task (if it's easy to extract automatically e.g. based on the number of new lines before or other cues) so that it is displayed in the lower right corner.