Segmenting text into blocks containing similar content

mneethling · June 3, 2021, 8:55am

Can someone recommend a way of going about segmenting a document containing survey questions into the individual questions? Simple regex matching is not robust enough for the various styles and formats i encounter. I've been wondering about using NER to detect the question number entity (format can be alpha numeric or numeric)? But open to any other approach. For example, consider the following document

Please complete this survey accurately.

Demographic Questions
=SEG=Q1)	What is your gender? 
	a)	Male
	b)	Female 
	c)	Prefer not to say

=SEG=Q2)	What is your age? [TEXT BOX, only allow whole numbers between 1-99]
Hidden Punch:
1)	Less than 18 years old
2)	18 to 24 years old
3)	25 to 34 years old
4)	35 to 44 years old
5)	45 to 54 years old
6)	55 to 64 years old
7)	65+

=SEG=[Show the following question if respondent is in the USA]
S3)	What is your zip code? [TEXT BOX, validate: 5 numbers]
a)	Zip Code

To illustrate I've added =SEG= sequences in the text where I would like a segmenter to segment the text above.

Thanks

ines · June 3, 2021, 12:10pm

Hi! Do you have some other examples of the types of texts you're dealing with? Because in this particular example (and other similar ones), a regex or similar rule-based match pattern seems to be a viable and probably the most efficient solution. Even if the question numbers all have slightly different formats, teaching an entity recognizer or similar span prediction model to detect them is possible, but it seems unlikely that it's going to beat a more systematic, rule-based approach? Even if you got to a really high accuracy of like 90%, it'd still mean that every 10th prediction is wrong, and it could potentially introduce another layer of complexity with false positives and negatives that are a lot harder to reason about or prevent.

mneethling · June 7, 2021, 8:16am

Hi Ines! Thanks, I appreciate you providing some insight. It really is in the narrow domain of questionnaires and I do have more examples. I probably just have to think harder about my rules!

Perhaps a conversation for a different thread: I've generally been thinking about the 'problem' of text segmentation as well. For example segmenting an email into various parts. (greeting, message, request, sign off, disclaimer) Perhaps in some cases it just comes down to text categorisation over the sentences and finding a 'boundary' between the different categories.

ines · June 8, 2021, 12:57am

Yes, agree, that definitely sounds very reasonable for use cases like that

Topic		Replies	Views
Composite entity/phrase chunks - best practices? usage , ner , spacy	2	523	July 6, 2020
Segmenting examples with long spans as NERs ner	3	964	June 28, 2018
Using a text classifier instead of NER usage , ner , textcat	5	764	May 31, 2021
Sentence segmentation in NER.teach ner , spacy , solved , legal	2	823	March 10, 2020
Questions about ner.teach and ner.correct usage , ner	10	378	January 11, 2024

Segmenting text into blocks containing similar content

Related topics