Segmenting list items in OCR output

thebenedict · October 15, 2019, 4:08pm

My project involves lists of materials. The list items aren't sentences, and they roughly have a format but it's not consistent enough to parse with a regular expression. For example:

6 colored pencils (any color)
1 meter of string
A cup of water, just off the boil
Scissors
Decoration of your choice

I eventually want to do NER on the text (eg. number, material, unit, notes). My issue is that I'm getting the lists from OCR of printed pages, and they're often printed in narrow sidebars where line breaks are used for formatting. The actual output looks like:

1 meter of string\nA cup of water, just\noff the boil\n

In order to recover the list items I'd need to know that newlines like the one between "just" and "off" are not list item boundaries. One thought I had is to split the OCR'd text into Prodigy tasks at all newlines, and classify them as "end of item" or "not end of item". Curious if that seems like the right approach, or, say, if somehow providing more context on either side of the newlines would be better? Part of my issue is that I'm not sure what this task is called so I've had trouble googling for best practices. Thanks a lot for any thoughts!

honnibal · October 16, 2019, 11:07am

I think classifying the newlines first does seem smart. Hopefully you can train this relatively easily, and then save out a dataset with the newlines corrected.

I wonder whether a rule-based approach might be okay enough for the newline classification task. You should at least be able to use some rules that reliably match one class or the other: for instance, if you have a number after the newline, is that almost always a divider? If so, that will help you bootstrap your classification, as that's more examples you don't need to annotate.

spaCy's Matcher patterns might be helpful in developing these rules, especially being able to match off properties like IS_DIGIT or using the SHAPE property. The online rule-builder might be helpful: https://explosion.ai/demos/matcher

One awkward thing that might interfere, however, is that v2.1 of spaCy has a hard-coded constraint that prevents whitespace items from being entities, as this prevents problems in noisy data in most cases. You might need to replace the newlines with some symbol sequence, e.g. -@- to get the NER model to work.

thebenedict · October 16, 2019, 3:06pm

Thank you for the reply, I'll give it a shot. This forum is such a great resource for Prodigy/spaCy and practical NLP advice. I really appreciate it.

Topic		Replies	Views
regex + training categories usage , spacy	2	655	August 19, 2019
Correct way to annotate data in my case (Spacy newbie here) usage , ner , spacy	1	582	October 29, 2020
Boundaries (token/offsets) on Ner annotations ner , database , solved	1	535	October 16, 2019
Using the NER_manual interface to annotate text classification usage , textcat , front-end	4	414	September 14, 2022
Invoice Parsing usage , ner , spacy	3	990	May 14, 2020

Segmenting list items in OCR output

Related topics