Tables and Text

I have a body of documents some data is contained in tables some data is contained in text. What facility can I use in spacy and consequently prodigy to test which will at least identify that I am looking a table and if possible tag a row column named entity. For example, what type of pattern would create?

We don’t have a function for table identification in spaCy. It’s definitely a useful process, and we’d like to have a recommended solution – so if you find a library that works particularly well for it, I hope you’ll share your experiences.

There’s a couple of reasons why I think it’s best not to provide this within spaCy itself. In order to keep the library to the right size, we want to only include functions which rely on spaCy internals or which rely on the core spaCy data structures. This mostly means processes which should be applied after tokenization, using the spaCy Doc object. Table identification is something you’d want to apply before spaCy, over the raw text. It’s similar to language identification in this respect, which we also don’t provide. I think the textacy library would be a good place to put the function, as its mission is to collect these useful processes that work “before and after” spaCy itself.

If you want to train a model for table identification, there’s a couple of ways you might use Prodigy to help you. One interesting idea would be to phrase the annotation as a free-form selection, and use the image annotation tools. I think this would be a pretty fast way to do it, but you’d have to do a little extra work to resolve the coordinates you annotate back to the document, which presumably would be in PDF or Word format.

Is there best practice for how table data should be formatted during pre-processing? By default, the cells of my tables are getting seperated into individual paragraphs, so I need an endgoal for formatting to be used by prodigy...ideally the table would be combined into a single jsonl that can be interpretted by the user.

I'm mentioning it because it might be of interest, but I just released a video on YouTube where I explain how to build a custom recipe for data deduplication in Prodigy. Part of the video explains how a .jsonl structure is used to generate the HTML in Prodigy.

In case it's of interested, it's viewable here.