Segmenting list items in OCR output

My project involves lists of materials. The list items aren't sentences, and they roughly have a format but it's not consistent enough to parse with a regular expression. For example:

6 colored pencils (any color)
1 meter of string
A cup of water, just off the boil
Decoration of your choice

I eventually want to do NER on the text (eg. number, material, unit, notes). My issue is that I'm getting the lists from OCR of printed pages, and they're often printed in narrow sidebars where line breaks are used for formatting. The actual output looks like:

1 meter of string\nA cup of water, just\noff the boil\n

In order to recover the list items I'd need to know that newlines like the one between "just" and "off" are not list item boundaries. One thought I had is to split the OCR'd text into Prodigy tasks at all newlines, and classify them as "end of item" or "not end of item". Curious if that seems like the right approach, or, say, if somehow providing more context on either side of the newlines would be better? Part of my issue is that I'm not sure what this task is called so I've had trouble googling for best practices. Thanks a lot for any thoughts!

I think classifying the newlines first does seem smart. Hopefully you can train this relatively easily, and then save out a dataset with the newlines corrected.

I wonder whether a rule-based approach might be okay enough for the newline classification task. You should at least be able to use some rules that reliably match one class or the other: for instance, if you have a number after the newline, is that almost always a divider? If so, that will help you bootstrap your classification, as that's more examples you don't need to annotate.

spaCy's Matcher patterns might be helpful in developing these rules, especially being able to match off properties like IS_DIGIT or using the SHAPE property. The online rule-builder might be helpful:

One awkward thing that might interfere, however, is that v2.1 of spaCy has a hard-coded constraint that prevents whitespace items from being entities, as this prevents problems in noisy data in most cases. You might need to replace the newlines with some symbol sequence, e.g. -@- to get the NER model to work.

Thank you for the reply, I'll give it a shot. This forum is such a great resource for Prodigy/spaCy and practical NLP advice. I really appreciate it.

1 Like