Okay cool – I'll get a contributor agreement ready. Something simple like the spaCy Contributor Agreement should probably do.
Yes, this makes a lot of sense and we definitely prefer the streaming approach. Don't bother with the __len__
if it means actually counting the texts – it only makes sense if, say, the headers exposed a total count that was easy to extract upfront. For very large corpora like this, using the total length to calculate the progress isn't that useful anyways, since you likely won't be annotating the whole thing at once. It's much more important that Prodigy is able to present the first tasks for annotation as quickly as possible.
I just had another idea for the LaTex / mathematical markup conversion: Instead of including this with the loader, you could also make it a preprocessor function that wraps a stream – like the built-in split_sentences
or fetch_images
(which converts all images, paths and URLs in the stream to data URIs). This would let you and others reuse the function with other loaders and and streams from different sources.
def add_latex_html(stream): # maybe needs a better name?
for eg in stream: # iterate over the examples
# convert the example text to HTML and replace markup, if found
html = CONVERT_TEXT_TO_HTML_WITH_IMAGES(eg['text'])
eg['html'] = html # add a html key to the task
yield eg
The preprocessor could then be used like this:
stream = JSONL('/path/to/your_data.jsonl') # or any other loader
stream = add_latex_html(stream)
Yes, this sounds good! We're also working on getting a prodigy-recipes
repository up (see here) – this should hopefully make it easier for users to contribute to the built-in recipes, and share their custom recipes and helper functions with the community.
Thanks again for your great work