Tables and Text

rkeyvani · July 22, 2018, 1:07pm

I have a body of documents some data is contained in tables some data is contained in text. What facility can I use in spacy and consequently prodigy to test which will at least identify that I am looking a table and if possible tag a row column named entity. For example, what type of pattern would create?

honnibal · July 23, 2018, 11:39am

We don’t have a function for table identification in spaCy. It’s definitely a useful process, and we’d like to have a recommended solution – so if you find a library that works particularly well for it, I hope you’ll share your experiences.

There’s a couple of reasons why I think it’s best not to provide this within spaCy itself. In order to keep the library to the right size, we want to only include functions which rely on spaCy internals or which rely on the core spaCy data structures. This mostly means processes which should be applied after tokenization, using the spaCy Doc object. Table identification is something you’d want to apply before spaCy, over the raw text. It’s similar to language identification in this respect, which we also don’t provide. I think the textacy library would be a good place to put the function, as its mission is to collect these useful processes that work “before and after” spaCy itself.

If you want to train a model for table identification, there’s a couple of ways you might use Prodigy to help you. One interesting idea would be to phrase the annotation as a free-form selection, and use the image annotation tools. I think this would be a pretty fast way to do it, but you’d have to do a little extra work to resolve the coordinates you annotate back to the document, which presumably would be in PDF or Word format.

zst · April 11, 2022, 12:56am

Is there best practice for how table data should be formatted during pre-processing? By default, the cells of my tables are getting seperated into individual paragraphs, so I need an endgoal for formatting to be used by prodigy...ideally the table would be combined into a single jsonl that can be interpretted by the user.

koaning · April 12, 2022, 8:28pm

I'm mentioning it because it might be of interest, but I just released a video on YouTube where I explain how to build a custom recipe for data deduplication in Prodigy. Part of the video explains how a .jsonl structure is used to generate the HTML in Prodigy.

In case it's of interested, it's viewable here.

Topic		Replies	Views
Bad formatting in gui for manual tagging ner , front-end	7	892	March 22, 2019
spaCy, prodigy, annotation usage , ner , solved	2	724	February 8, 2019
How do I use prodigy as a purely annotation tool with no underlying SpaCy model? usage	1	1591	April 27, 2018
I'm new to python and NLP. I would like to evaluate Prodigy and need guidance on getting started. usage , best-practices	3	563	February 16, 2021
Combining Document Layout Analysis with NLP spacy	1	815	February 26, 2019

Tables and Text

Related topics