Data prep

I've been trying to get started with prodigy for about a little while now. I've gone through the spacey training course and a textbook on spacey but these don't seem to get into the topic of pre-processing data. I've searched through these forums and most of the relevant questions also gloss over the actual data processing step (e.g. Create a dataset out of many txt_files documents (Best Practice) - Prodigy Support).

My project involves about 2,000 documents that are between 20 to 80 pages. There are about 8 sections in each document. Due to the history of the work, I have most of them available in PDF , docx, html, and txt formats.

I need to categorise each of these at the document level for 4 exclusion/inclusion criteria. After that I need to do categorisation and classification for the included sample.

Is the best way forward to convert the txt files into jsonl or would it be better to retain the structure from html or docx? Either way, are you able to point me towards a reference or explain how I would do this for all the files in a specific directory?

I'm thinking that once the initial 4 criteria are labeled, I can parse up the documents and start again. Does that make sense?

Hi! I think more genrally, JSON or JSONL is s good format for storing annotations, because it allows attaching additional meta information (e.g. details on the structure of the document or which document a paragraph refers to) and it supports nested data structures. It also lets you include text with newlines \n in the individual examples, which is more difficult to handle if you only use plain text that you read in line by line.

So I think it makes sense to invest some time into standardising the data format and making sure all your texts are available in the same format that can be read in easily in Python and other processes. PDFs and Word documents are often a bad "source of thruth" because their format can vary widely – PDFs can pretty much include anythning and docx is a proprietary format. So standardising on a JSON-like format with the same consistent metadata can help a lot, especially if your goal is to train an NLP pipeline.

How you do this depends on the data you have – you probably want to have different scripts for extracting the texts from PDFs (using a library like pypdf, pymupdf), and a different script to load and parse Word docs. At the end of it, each script would then export a JSON(L) file with a key {"text": "..."} for the raw text and any additional meta info you want to include.

To follow up on what I did - I converted from docx to html using:

    command=f"""$word_app = New-Object -ComObject Word.Application
    $document.SaveAs([ref] "{pathname}{filenameonly}.htm", [ref]10)

    p=subprocess.Popen(["powershell","& {" + command+ "}"])

which preserves numbered lists (eg 7.1) then used beautifulsoup4 to remove tables and otherwise processed the data to be appended to a jsonl file