Template for Prodigy corpus and API loaders

ines · February 23, 2018, 11:42am

Okay cool – I'll get a contributor agreement ready. Something simple like the spaCy Contributor Agreement should probably do.

Yes, this makes a lot of sense and we definitely prefer the streaming approach. Don't bother with the __len__ if it means actually counting the texts – it only makes sense if, say, the headers exposed a total count that was easy to extract upfront. For very large corpora like this, using the total length to calculate the progress isn't that useful anyways, since you likely won't be annotating the whole thing at once. It's much more important that Prodigy is able to present the first tasks for annotation as quickly as possible.

I just had another idea for the LaTex / mathematical markup conversion: Instead of including this with the loader, you could also make it a preprocessor function that wraps a stream – like the built-in split_sentences or fetch_images (which converts all images, paths and URLs in the stream to data URIs). This would let you and others reuse the function with other loaders and and streams from different sources.

def add_latex_html(stream):  # maybe needs a better name?
    for eg in stream:  # iterate over the examples
        # convert the example text to HTML and replace markup, if found
        html = CONVERT_TEXT_TO_HTML_WITH_IMAGES(eg['text'])
        eg['html'] = html  # add a html key to the task
        yield eg

The preprocessor could then be used like this:

stream = JSONL('/path/to/your_data.jsonl')  # or any other loader
stream = add_latex_html(stream)

Yes, this sounds good! We're also working on getting a prodigy-recipes repository up (see here) – this should hopefully make it easier for users to contribute to the built-in recipes, and share their custom recipes and helper functions with the community.

Thanks again for your great work

Topic		Replies	Views
Prodigy recipe on your github page appears to not work. Out of date? usage , terms , solved	3	526	February 17, 2020
Script: Load data in spaCy v3's .spacy format Getting Started spacy , project , streams , nightly	4	2227	January 21, 2023
Prodigy Output Visualization, Dependencies Structure Training Help usage , ner , spacy , dep	3	866	June 18, 2021
Loading a text file usage , ner , spacy , solved , nightly	4	519	July 5, 2021
Prodigy 1.13.0 is out! :tada: news	2	288	August 23, 2023

Template for Prodigy corpus and API loaders

Related Topics