Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format

Background: I am working with a large amount of legal documents where I want to utilize NER, NER-Relationships, Text Classification, Document Classification to extract information from them. Essentially, I want to do something similar to what johnsnowlabs have done.

I understand this will/may require either fine tuning or building a models from scratch. But before I get there my first hurdle is to create a dataset from these documents that i can annotate. While my documents may contain tables or images, I am currently only interested in the text.

Questions:

  • What is the current best python library to extract this data? pymupdf?
  • What are the best practices to complete this this data extraction?

This is the start of a long project, so I appreciate help from anyone.

Hi and welcome! :wave: This is great timing, because we actually just released spacy-layout and corresponding Prodigy recipes that integrate with the new Docling library and models for processing PDFs and similar documents, layout analysis and more.

I also recently published a blog post that goes into a bit more details, examples, best practices and annotation workflows:

The Prodigy workflows include image-based and text-based annotation modes and are available on GitHub, so you can check out the source and adapt them to build something fully custom if needed: GitHub - explosion/prodigy-pdf: A Prodigy plugin for PDF annotation

The best strategy you choose will of course depend on your documents and their structure, but Prodigy should hopefully make it easy to try things out and iterate. For example, for some document types, it can be helpful to annotate with a preview of the layout or even start with an image-based approach, while in other cases where plain text is the focus, it's best to abstract away the layout as early as possible and break the task down into smaller pieces you can work on independently.

Definitely let us know how you go – the workflows are still quite new and document processing tasks can be complex, so we're always interested in finding out which approaches work best for real-world use cases and which best practices to recommend :slightly_smiling_face:

Hi @ines
thank you for your quick response

I should have mentioned, I previously looked into (and tested) docling, spacy-layout and the prodigy-pdf plugin.

docling / spacy-layout: provided great extraction, capturing the layout information, but I was confused on how to convert this into the required annotation format.

  • Pro: Easy to use and good extract.
  • Cons: Unaware of how to convert this to jsonl annotation dataset.

prodigy-pdf: using the pdf.layout.fetch allowed me to ingest the information into prodigy and do some annotation, but even with –split-pages enabled, I felt the content length was too big as I ultimately wanted to fine-tune a hugging-face transformer (legal-bert) which I believe has a token limit of 512.

  • Pro: Works well
  • Cons: Content length seems too long for end use (legal-bert)

Pymupdf: I recently wrote a custom script using pymupdf to extract the data and, based on the blocks, construct “paragraphs” along with their span information (using spacy). This data is ultimately saved as a jsonl file which can be used in prodigy.

  • Pro: It works, but seems messy.
  • Cons: It retains header / footer information (which docling could identify and filterout) which is not vital and adds noise to annotations.

I feel I may be overthinking this and the prodigy-pdf plugin should be my option. Please let me know your thoughts.

For some additional context, the legal documents I am working with are lease agreement based, so they mainly contain text and tables and are all very well structured (headings, sections).

As mentioned in my initial comment the end goal is to setup a document processing pipeline:

Ingest Document -> Document Classification -> Clause Classification -> NER -> NER Relationship

Thanks again

Hi @ines ,

After some further research and testing, I was able to progress with data extraction using prodigy-pdf. I re-ran pdf.layout.fetch to convert a test pdf, specifically focusing on text, list_item, section_header layout items and saved this to a dataset.

I then attempted to run this through textcat.manual but ran into another issue in that the UI was only displaying the image that was extracted and not the associated text to classify. After some more reading, I was able to create my own custom recipe that was something similar to pdf.spans.manual, in that it provided a block view with choice on the left and image (pdf page) with bounding box on the right.

The next issue I am facing is that this recipe is not allowing me to do multilabel classification on the texts even though I have "choice_style": "multiple" listed within the config. Below I have provided my custom recipe.

When running this, I am only able to select one label per annotation.
Please advise.

@prodigy.recipe(
    "textcat.contract.manual",
    out_dataset=("Dataset to save annotations into", "positional", None, str),
    in_dataset=("Dataset loader annotations from", "positional", None, str),
)
def custom_recipe(out_dataset: str, in_dataset: str):

    # Log recipe details
    log("RECIPE: Starting recipe textcat.contract.manual", locals())

    # Connect to Prodigy database
    db = connect()

    # Define labels for text categorization
    textcat_labels = ["ACOPERATION", "DEFINITION","DEFAULT","GOVLAW","INSURANCE",
                      "LEASETERM", "MAINTENANCE","MODS","MISC","PAYMENTS","REDELIVERY",
                      "SCHEDULES","TERMINATION","WARRANTY"]

    # Helper functions for adding user provided labels to annotation tasks.
    def add_label_options_to_stream(stream, labels):
        options = [{"id": label, "text": label} for label in labels]
        for task in stream:
            task["options"] = options
            yield task
        
    # Function to call when annotations are returned
    def update(examples):
        print(f"Received {len(examples)} annotations!")

    stream =db.get_dataset_examples(in_dataset)
    stream = add_label_options_to_stream(stream, textcat_labels)
    
    blocks = [
        {"view_id": "choice", "text": None},
        {"view_id": "image"},
    ]

    css = CSS
    css += CSS_PREVIEW

    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": out_dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "update": update,  # Function to call when annotations are returned
        "config": {  # Additional config settings, mostly for app UI
            "labels": textcat_labels,
            "global_css": css,
            "blocks": blocks,
            "choice_style": "multiple",
            "shade_bounding_boxes": True,
            "custom_theme": {
                "cardMaxWidth": "95%",
                "smallText": FONT_SIZE_TEXT,
                "tokenHeight": 25,
            },
        },
    }

Hi @dmnxprss,

First of all thanks a lot for sharing an extensive feedback on your experience with the new prodigy-pdf recipes - this would definitely help us to make the solution more flexible.

Before getting on to your custom recipe, I just wanted to mention that the easiest solution to use pdf.layout.fetch output in textcat.manual would be to remove the image key from the input stream.
I realize that it might be a bit confusing, in the end textcat manual should only be looking for the text as input key but since it uses the choice UI (which is used for classifying images as well) it also checks for the image field first. We are actually considering improving the configuration to make the choice of the input key more explicit.

With respect to your custom recipe, the actual problem is that the choice UI is not being rendered at all. The labels you see come from the spans_manual UI that comes from blocks defined in the input stream (i.e. the output of the pdf.layout.fetch recipe). The fact that these incoming blocks are defined on the task level means that they take priority over the blocks defined globally in your recipe. In other words, the blocks you define in the recipe:

blocks = [
        {"view_id": "choice", "text": None},
        {"view_id": "image"},
    ]

are being ignored because each task comes with blocks defined on the task level and these are:

[
    {'view_id': 'spans_manual'},
    {'view_id': 'image', 'spans': [{...}]}
]

So in order to render blocks with span annotated image, text (from the bounding box) and the choice, you'd need to modify the blocks on the task level or remove the blocks that come with the input stream and define yours globally. You probably also want to get rid of spans_manual UI altogether.
If you go with the first option (modify blocks on the task level) you could modify your option adding function like so:

def add_label_options_to_stream(stream, labels):
    options = [{"id": label, "text": label} for label in labels]
    for task in stream:
        task["options"] = options
        task["config"]["choice_style"] = "multiple"
        
        # Filter out spans_manual blocks and add new view blocks
        blocks = [block for block in task["config"]["blocks"] 
                 if block["view_id"] != "spans_manual"]
        blocks.extend([
            {"view_id": "text"},
            {"view_id": "choice", "text": None, "image": None}
        ])
        
        task["config"]["blocks"] = blocks
        yield task

You should now be rendering only image, text and choice UI.
Please note that you'll need to adjust the CSS of your custom recipe to render all the elements in the right columns.

Here's how the recipe could be updated but you might want want to tweak the CSS to your preference:

# Selectors for each component
CSS_IMAGE = "div.prodigy-content:nth-child(2)"
CSS_TEXT = "div.prodigy-content:nth-child(3)"
CSS_CHOICE = "._Choice-root-0-1-196"

# Container setup
CSS_CONTAINER = """
.prodigy-container {
    display: grid;
    grid-template-columns: 1fr 50%;
    height: 100vh;
    overflow-y: auto;
    overflow-x: hidden;
}
"""

# Layout CSS
CSS_PREVIEW = f"""
{CSS_CONTAINER}

/* Left column content - setting up the flex container */
{CSS_IMAGE} {{
    grid-column: 1;
    grid-row: 1;
    border-right: 1px solid #ddd;
    height: 100%;
}}

{CSS_TEXT} {{
    grid-column: 1;
    grid-row: 2;
    border-right: 1px solid #ddd;
    height: 100%;
    top: 0;
}}

/* Right column for choices */
{CSS_CHOICE} {{
    grid-column: 2;
    grid-row: 1;
}}
"""
FONT_SIZE_TEXT = 14

@prodigy.recipe(
    "textcat.contract.manual",
    out_dataset=("Dataset to save annotations into", "positional", None, str),
    in_dataset=("Dataset loader annotations from", "positional", None, str),
)
def custom_recipe(out_dataset: str, in_dataset: str):

    # Log recipe details
    log("RECIPE: Starting recipe textcat.contract.manual", locals())

    # Connect to Prodigy database
    db = connect()

    # Define labels for text categorization
    textcat_labels = ["ACOPERATION", "DEFINITION","DEFAULT","GOVLAW","INSURANCE",
                      "LEASETERM", "MAINTENANCE","MODS","MISC","PAYMENTS","REDELIVERY",
                      "SCHEDULES","TERMINATION","WARRANTY"]

    # Helper functions for adding user provided labels to annotation tasks.
    def add_label_options_to_stream(stream, labels):
        options = [{"id": label, "text": label} for label in labels]
        for task in stream:
            task["options"] = options
            task["config"]["choice_style"] = "multiple"
            
            # Filter out spans_manual blocks and add new view blocks
            blocks = [block for block in task["config"]["blocks"] 
                     if block["view_id"] != "spans_manual"]
            blocks.extend([
                {"view_id": "text"},
                {"view_id": "choice", "text": None, "image": None}
            ])
            
            task["config"]["blocks"] = blocks
            yield task
        
    # Function to call when annotations are returned
    def update(examples):
        print(f"Received {len(examples)} annotations!")

    stream =db.get_dataset_examples(in_dataset)
    stream = add_label_options_to_stream(stream, textcat_labels)

    css = CSS_PREVIEW

    return {
        "view_id": "blocks",  # Annotation interface to use
        "dataset": out_dataset,  # Name of dataset to save annotations
        "stream": stream,  # Incoming stream of examples
        "update": update,  # Function to call when annotations are returned
        "config": {  # Additional config settings, mostly for app UI
            "global_css": css,
            "shade_bounding_boxes": True,
            "custom_theme": {
                "cardMaxWidth": "95%",
                "smallText": FONT_SIZE_TEXT,
                "tokenHeight": 25,
            },
        },
    }

Hi @magdaaniol,

Apologies as I never received an alert of your response. Your code worked and was extremely helpful!

With help from you and @ines I have been able to:

  1. Convert existing PDFs to annotation datasets (custom prodigy-pdf recipe)
  2. Setup a custom text annotation process (with a side pdf panel for verification)

My next question is a general process question.

For cases like mine where there are 10+ text cat labels, your notes (and other question) mention attacking these on a category by category case as its more efficient for the annotator.

Could you explain how i could achieve this in my setup?
Should I conduct multiple runs for each category?
After these runs, how would I merge these into a final dataset from which to test for accuracy?

Thanks

Hi @dmnxprss,

Happy to hear your setup works now!
As for the notification about the updates to the thread (sorry you've mentioned that before already), right below the last post there's this grey alarm button that lets me set the notification level:


Then in your Profile > Preferences > Emails there seem to be relevant settings for receiving en email alerts.
Not sure if my admin view is very different from a user view, but it seems like you should be able to set it from your profile preferences. Or perhaps the alerts end up in your spam folder?

On to the thing :slight_smile:
If we talk about 10+ labels and the categories are fairly complex, then yes, it's probably better to frame the annotation as a series of binary yes/no questions about each category so that the annotators don't have to keep all the criteria in memory.
In practice, that would mean performing multiple passes over the datasets, yes. However, it's usually faster and less error prone as the annotators are laser focused on one particular category.
You would use classifications UI block rather than choice and provide just one label at a time.
Once you've collected your binary datasets, you should be able to use them directly with the train or data-to-spacy recipe. Ideally, you should be training on all binary datasets together. The above mentioned Prodigy commands will take care of merging all annotations on the same input, so if an example contains binary annotations for several labels, the model will be updated with all this information.

1 Like