Can someone recommend a way of going about segmenting a document containing survey questions into the individual questions? Simple regex matching is not robust enough for the various styles and formats i encounter. I've been wondering about using NER to detect the question number entity (format can be alpha numeric or numeric)? But open to any other approach. For example, consider the following document
Please complete this survey accurately. Demographic Questions =SEG=Q1) What is your gender? a) Male b) Female c) Prefer not to say =SEG=Q2) What is your age? [TEXT BOX, only allow whole numbers between 1-99] Hidden Punch: 1) Less than 18 years old 2) 18 to 24 years old 3) 25 to 34 years old 4) 35 to 44 years old 5) 45 to 54 years old 6) 55 to 64 years old 7) 65+ =SEG=[Show the following question if respondent is in the USA] S3) What is your zip code? [TEXT BOX, validate: 5 numbers] a) Zip Code
To illustrate I've added =SEG= sequences in the text where I would like a segmenter to segment the text above.