Section Heading Identificaton. Expert cohort sense check.

magdaaniol · February 7, 2024, 4:05pm

If your section are always in the same order you can, indeed, just identify the headings (using a classifier or patterns) and then select the chunks of text in between. If you are using patterns to identify headers, then you don't actually need Prodigy (as you won't be training any models). You might be able to implement that, for example, as a spaCy pipeline with custom components.

One step up would be to train the model for identifying the headers. Depending on the quality of the data and the annotations, the model should give you better recall than patterns. For that you could use Prodigy span.manual recipe to annotate the headers. Once the headers are identified, again, you could just select the text in between as a postprocessing step.

If you would like to do the extraction in one step, you might train a text classifier to classify each sentence or even paragraph in this case to the section it belongs. For that you could use Prodigy textcat with multiple options (one option per section) or binary textcat where you would have multiple (but really speedy) passes over the data each time asking a question whether a given paragraph belongs to the section in question.

This post discusses a problem very similar to yours in case you want to have a look: what is best way to to extract paragraph or long sentences in a text document?

Topic		Replies	Views
what is best way to to extract paragraph or long sentences in a text document? usage	18	3681	August 9, 2020
New to Prodigy: Annotation Structure Advice (Big Section of Text vs Separating Sentences) usage , ner , spancat	2	316	November 20, 2023
Text classification - content of a web page usage , textcat , solved	2	700	August 31, 2018
Extracting useful information from Job description ner , textcat , spancat	1	1558	January 24, 2023
Document layout analysis usage , image , custom	6	1162	March 10, 2021

Section Heading Identificaton. Expert cohort sense check.

Related topics