Template for Prodigy corpus and API loaders

ines · February 22, 2018, 9:36am

Wow, this is amazing – thanks so much for sharing your code! So you'd be okay with having this integrated into the core library? If so, we might have to ask you to fill in a simple contributor agreement.

Your code looks good and very similar to the built-in loaders, actually. The built-in loaders are all classes that are initialised with optional loader-specific settings and an __iter__ method that yields the tasks. If the corpus or API response exposes the total number of results upfront, I've also added a __len__ method. (If a recipe doesn't resort the stream and doesn't expose a custom progress function, Prodigy checks whether the stream exposes a __len__ attribute, and uses that to calculate the progress. But of course, this only works makes sense if the stream is finite and/or we know its length upfront.)

Not at the moment – but it'd definitely be nice to make this possible! I'm not sure what the best solution would be – we could try solving this via entry points, similar to what we're planning for custom recipes. We could also consider allowing the --loader argument to point to a file instead?

prodigy ner.teach dataset model posts.xml --loader loaders.py::StackExchange

In theory, this is possible – but it might be a little hacky. If there's an easy way to extract the LaTex markup from the text, you could, for example, use a custom HTML template for this and add a "html" key to the task (instead of the "text"). You could then use a library to convert it to an image – ideally an SVG, because you can simply inline this with the HTML. You can then replace the markup with the image and produce tasks that look like this:

{"html": "Text text <svg>...</svg> text text"}

If you can only convert it to a JPG or PNG, the whole thing is a little tricker – but not impossible. You could save out the files and include them as <img> tags – but this is pretty inconvenient, especially if you do that while processing the stream. So ideally, you'd also want to inline the image with the markup. For Prodigy's image recipes, I've written a few helper functions to convert images to base64-encoded data URIs (those are currently internals):

from prodigy.util import img_to_b64_uri

image_bytes = CONVERT_YOUR_LATEX_MARKUP_TO_A_PNG()
mimetype = 'image/png'
data_uri = img_to_b64_uri(image_bytes, mimetype)
# this will produce a string like: data:image/png;base64,iVBORw0...
image_html = '<img src="{}" />'.format(data_uri)

Depending on the image size, this can easily add some undesired bloat to your datasets, though. In general, I'd always recommend keeping a "text" key containing the original string in addition to the "html" property. This way, you'll always be able to relate the generated HTML back to the original text. (It'll also make debugging easier if something goes wrong during conversion.)

Topic		Replies	Views
Using Loaders usage , solved	8	3578	November 12, 2018
Create Custom Loader usage , ner , custom	21	3870	August 14, 2019
How to ask questions on the forum	1	1444	July 21, 2023
💥 Prodigy v1.14.0 is out 💥 improvements to the internals & chain-of-thought prompting support news	1	336	September 21, 2023
Prodigy recipe on your github page appears to not work. Out of date? usage , terms , solved	3	571	February 17, 2020

Template for Prodigy corpus and API loaders

Related topics