Fixing wrong whitespaces - modelling question

nix411 · June 2, 2020, 2:17pm

I have some digital PDFs where I extract tables to pandas dataframe. It works quite good but sometimes there are whitespaces that shouldn't be there, e.g.

Cos t of s a l es
Res ea rch a nd devel opment cos ts

Since I'm able to decipher where the whitespaces should be, then maybe I can create some spaCy component that are able to as well but I'm not sure how. I could make it a textcat task for each possible merge of tokens but it depends on the whole phrase really - not just the merged tokens. I'm not sure if this forum is the right place to ask though - could be that stackoverflow is a better choice?

honnibal · June 3, 2020, 12:11pm

I think a character-based language model to predict the spacing would probably work pretty well. You would train the model on normal text, and remove the spaces that are in your documents that the language model assigns a low probability to.

And yeah, sadly I think this is out of scope for this board. It may be that you find a use for Prodigy in the project to score the extractions or something, but it won't be the main component of the solution. I don't think text classification is necessarily right either. Most of the decision is about the unigram and bigram probabilities of the words. For instance, cost is much more likely than cos t, while but the is much more likely than butthe.

Topic		Replies	Views
Tip: Preprocessing text (whitespace, unicode) with textacy usage , custom , solved	2	2585	November 7, 2019
Text classification and whitespace textcat , spacy	3	869	February 18, 2018
model with subword usage , custom	1	443	February 6, 2020
Annotating strings without correct separation ner , best-practices	8	192	November 21, 2024
BIO (E/S) encodings for prodigy annotations in sequence labeling applications ner	3	1149	May 23, 2018

Fixing wrong whitespaces - modelling question

Related topics