I have some digital PDFs where I extract tables to pandas dataframe. It works quite good but sometimes there are whitespaces that shouldn't be there, e.g.
Cos t of s a l es
Res ea rch a nd devel opment cos ts
Since I'm able to decipher where the whitespaces should be, then maybe I can create some
spaCy component that are able to as well but I'm not sure how. I could make it a textcat task for each possible merge of tokens but it depends on the whole phrase really - not just the merged tokens. I'm not sure if this forum is the right place to ask though - could be that stackoverflow is a better choice?
I think a character-based language model to predict the spacing would probably work pretty well. You would train the model on normal text, and remove the spaces that are in your documents that the language model assigns a low probability to.
And yeah, sadly I think this is out of scope for this board. It may be that you find a use for Prodigy in the project to score the extractions or something, but it won't be the main component of the solution. I don't think text classification is necessarily right either. Most of the decision is about the unigram and bigram probabilities of the words. For instance,
cost is much more likely than
cos t, while
but the is much more likely than