Wow, this is amazing β thanks so much for sharing your code! So you'd be okay with having this integrated into the core library? If so, we might have to ask you to fill in a simple contributor agreement.
Your code looks good and very similar to the built-in loaders, actually. The built-in loaders are all classes that are initialised with optional loader-specific settings and an __iter__
method that yields the tasks. If the corpus or API response exposes the total number of results upfront, I've also added a __len__
method. (If a recipe doesn't resort the stream and doesn't expose a custom progress function, Prodigy checks whether the stream exposes a __len__
attribute, and uses that to calculate the progress. But of course, this only works makes sense if the stream is finite and/or we know its length upfront.)
Not at the moment β but it'd definitely be nice to make this possible! I'm not sure what the best solution would be β we could try solving this via entry points, similar to what we're planning for custom recipes. We could also consider allowing the --loader
argument to point to a file instead?
prodigy ner.teach dataset model posts.xml --loader loaders.py::StackExchange
In theory, this is possible β but it might be a little hacky. If there's an easy way to extract the LaTex markup from the text, you could, for example, use a custom HTML template for this and add a "html"
key to the task (instead of the "text"
). You could then use a library to convert it to an image β ideally an SVG, because you can simply inline this with the HTML. You can then replace the markup with the image and produce tasks that look like this:
{"html": "Text text <svg>...</svg> text text"}
If you can only convert it to a JPG or PNG, the whole thing is a little tricker β but not impossible. You could save out the files and include them as <img>
tags β but this is pretty inconvenient, especially if you do that while processing the stream. So ideally, you'd also want to inline the image with the markup. For Prodigy's image recipes, I've written a few helper functions to convert images to base64-encoded data URIs (those are currently internals):
from prodigy.util import img_to_b64_uri
image_bytes = CONVERT_YOUR_LATEX_MARKUP_TO_A_PNG()
mimetype = 'image/png'
data_uri = img_to_b64_uri(image_bytes, mimetype)
# this will produce a string like: ...
image_html = '<img src="{}" />'.format(data_uri)
Depending on the image size, this can easily add some undesired bloat to your datasets, though. In general, I'd always recommend keeping a "text"
key containing the original string in addition to the "html"
property. This way, you'll always be able to relate the generated HTML back to the original text. (It'll also make debugging easier if something goes wrong during conversion.)