Segmenting text into blocks containing similar content

Can someone recommend a way of going about segmenting a document containing survey questions into the individual questions? Simple regex matching is not robust enough for the various styles and formats i encounter. I've been wondering about using NER to detect the question number entity (format can be alpha numeric or numeric)? But open to any other approach. For example, consider the following document

Please complete this survey accurately.

Demographic Questions
=SEG=Q1)	What is your gender? 
	a)	Male
	b)	Female 
	c)	Prefer not to say

=SEG=Q2)	What is your age? [TEXT BOX, only allow whole numbers between 1-99]
Hidden Punch:
1)	Less than 18 years old
2)	18 to 24 years old
3)	25 to 34 years old
4)	35 to 44 years old
5)	45 to 54 years old
6)	55 to 64 years old
7)	65+

=SEG=[Show the following question if respondent is in the USA]
S3)	What is your zip code? [TEXT BOX, validate: 5 numbers]
a)	Zip Code

To illustrate I've added =SEG= sequences in the text where I would like a segmenter to segment the text above.


Hi! Do you have some other examples of the types of texts you're dealing with? Because in this particular example (and other similar ones), a regex or similar rule-based match pattern seems to be a viable and probably the most efficient solution. Even if the question numbers all have slightly different formats, teaching an entity recognizer or similar span prediction model to detect them is possible, but it seems unlikely that it's going to beat a more systematic, rule-based approach? Even if you got to a really high accuracy of like 90%, it'd still mean that every 10th prediction is wrong, and it could potentially introduce another layer of complexity with false positives and negatives that are a lot harder to reason about or prevent.

Hi Ines! Thanks, I appreciate you providing some insight. It really is in the narrow domain of questionnaires and I do have more examples. I probably just have to think harder about my rules!

Perhaps a conversation for a different thread: I've generally been thinking about the 'problem' of text segmentation as well. For example segmenting an email into various parts. (greeting, message, request, sign off, disclaimer) Perhaps in some cases it just comes down to text categorisation over the sentences and finding a 'boundary' between the different categories.

Yes, agree, that definitely sounds very reasonable for use cases like that :+1: