Dataset Creation Newbie

Hello fellow AI enthusiasts. I picked up prodigy to begin working on cleaning up datasets for use in some personal home projects. I have been able to annotate and create some fantastic datasets that have already been json preformatted. I was wondering if prodigy can be used to create a clean dataset for LLM chat training from raw text?

The data I have will be very useful for what I'd like to do, but it's buried in things such as HTML headers, or email header data, etc. I know I could whip up a script to clean that extraneous data off, but if prodigy can already do it, why reinvent the wheel? :slight_smile:

Hi there!

Having an interface for LLMs is something that I am internally investigating a bit, but it's hard to find something general that will work for everyone. My main advise for now to is to try and build something with custom HTML templates, mainly because that's the most flexible.

If you have feedback/ideas feel free to post them here though. I'll gladly think along and I'm also eager to understand your use-case. Do you have a sketch of an interface that you had in mind?

Hi there! Thank you so much for the response.

This isn't so much an interface, more along the line of using prodigy to clean a dataset that I will be using to finetune a language model (more for chat than data analysis).

The dataset I have is a series of hand crafted notes and stories involving a world that was built by a client; however, due to the nature of the multitude of tools and applications this data was pulled from, it comes with all sorts of extra garbage data, such as email headers, old MS office doc formats, plain text, and some that were put together in some HTML editor that left HTML artifacts all over it.

I was wondering if prodigy would have the ability to help me (through classification) tag garbage data (or the good data) and help me prune it from the many documents so I can convert it into a larger JSON file to use in a Llama 2 finetune experiment.

Long story short, was hired by a writer to turn the notes and stories of his world into a training dataset and tune it into a Llama 2 model for him to interact with. I am hoping prodigy can be used to help me clean up the documents, I am just not sure how to begin a process like that. I have mostly worked with straight JSON for NLP work, first time doing a dataset for pure human interaction with LLM inference.

That sounds like you might be interested in a classification task that's simply making the choice between "clean data" and "non clean data". That's something that textcat.manual could do out of the box.

The interface could then look something like this?

python -m prodigy textcat.manual xxx examples.jsonl --label clean

You could go a step further, because maybe it's worth having a distinction between HTML artifacts and MSword artifacts.

python -m prodigy textcat.manual xxx examples.jsonl --label html,word,other

Might something like this make sense?

Taking a step back

One thing about your task though. Do you really need to collect data in order to figure out that there are HTML artifacts in your data? There are tools like justext that do a pretty reasonable job at removing HTML, which would also remove the HTML artifacts.

import justext

text = "And then John the archer went into <span><b>the woords</i>"

# justext assumes properly formatted html, so wrap in <p> tags
wrapped = f"<p>{text}</p>"

# justtext finds all nested paragraphs and iterates over them
paragraphs = justext.justext(text, justext.get_stoplist("English"))
for paragraph in paragraphs:

This yields the following sentence:

And then John the archer went into the woords

I imagine there's similar cleaning approaches for Word documents, but odds are that it may be more pragmatic to use such a heuristic tool and to see how far you'll be able to stretch that.

Don't get me wrong, it can still be a good idea to use Prodigy to double-check if the heuristics work as expected. But I figured I should at least mention it in case you weren't aware. It's also possible that you already considered these tools but aren't using them for a reason that I'm unaware of. If that's the case: let me know! I'm still eager to help think along.

Final thing

We have some tools that might help with prompt engineering. I understand this isn't what you're currently after but I figured mentioning it early might be good.