Template for Prodigy corpus and API loaders

I have built a corpus loader for the StackExchange data dump. Since this is a large, open source of semi-structured text data, it’s a veritable gold mine for SpaCy and Prodigy and I’d like to make it available to the community. Since all Prodigy’s loaders are compiled, I can’t see how you’ve built your Reddit loader, and conform my code to your style. Do you have a template I can use to conform my code so that it’s as easy as possible for you to integrate into Prodigy?

Additionally, is there a way to make the loader available on the command line? Is the Prodigy App capable of displaying LaTex an other mathematical markup languages? as in this example. Would it even make sense to display the rendered LaTex?

Regardless, here is my code for the StackExchange Corpus loader.

import re
import html
from html.parser import HTMLParser
from xml.etree import ElementTree as ET
from pathlib import Path

class StackExchange(object):
    class __MLStripper__(HTMLParser):
        HTML Parser that receives a string with HTML tags, strips out tags. get_data() will return a string devoid of HTML tags.
        def __init__(self):
            self.strict = False
            self.convert_charrefs= True
            self.fed = []
        def handle_data(self, d):
        def get_data(self):
            return ''.join(self.fed)
    def __init__(self, file, community=None, content_type='post_title', remove_html=True):
        A Prodigy compliant corpus loader that reads a StackExchange xml file and yields a stream of text in dictionary format.
        :param file: string path name to xml file
        :param community: string, name of stackexchange community
        :param content_type: string, select the type of text to return: post_title, post_body, comments
        :param remove_html: Boolean, Remove or keep HTML tags in the text
        # Check that the path actually exists and is a recognizable XML file
        se_file = Path(file).absolute()
        assert (se_file.exists()), "Cannot find file. Please check the path name and try again"
        assert (se_file.suffix == '.xml'), "File does not end in '.xml'. Please check the path name and try again"
        self.file = se_file
        #If user doesn't supply the community assume a normal unziping process occured and the community is the parent directory
        if not community:
             community = self.file.parent.parts[-1]
        self.community = community
        # Acceptable types of StackExchange text content
        self._TYPES = ['post_title', 'post_body', 'post_both', 'comments']
        assert (content_type.lower() in self._TYPES), "Content Type not understood. Acceptable types include {}".format(self._TYPES)
        self.content_type = content_type
        assert(remove_html in [True, False]), "remove_html must be either True or False"
        self.remove_html = remove_html
        # Lazily load the xml file, puts a blocking lock on the file
        self.tree = ET.iterparse(self.file.as_posix(), events=['start', 'end'])
    def _parse_tags(self, tags):
        Parse the Tags attribute of a row in a StackExchange Posts xml file
        :param tags: string, tags formatted between <>
        :returns: List of tags or None
        if tags == '':
            return None
            t = re.compile('<(.+?)>')
            m = t.findall(tags)
            return m
    def __iter__(self):
        # Iterate through the file and yield the text
        for _, child in self.tree:

            # Start of file, check that the file matches the expected content_type
            if _ == 'start' and child.tag != 'row':
                if self.content_type in self._TYPES[:3]:
                    assert(child.tag == 'posts'), "Input file is not a StackExchange Posts.xml file. Please check the path name and try again"

                    assert(child.tag == 'comments'), "Input file is not a StackExchange Comments.xml file. Please check the path name and try again"

                 # Assemble the prodigy stream compliant dictonary object
                info = {"meta": {"source": "StackExchange", "Community": self.community, "type": self.file.stem}}

                atb = child.attrib
                if self.content_type in self._TYPES[:3]:
                    title = atb.get('Title', None)
                    body = atb.get('Body', None)
                    tags = atb.get('Tags', None)

                    if self.content_type == 'post_both':
                        if title and body:
                            text =  title + '\n' + body 
                        elif body and not title:
                            text = body
                        elif title and not body:
                            text = title
                            text = None

                    elif self.content_type == 'post_title':
                        text = title

                    elif self.content_type == 'post_body':
                        text = body

                elif self.content_type == 'comments':
                    tags = None
                    text = atb.get('Text', None)

                    tags = None
                    text = None

                # Check to see if valid text was found, if not, skip to the next xml child element
                if not text:

                    # unescape HTML encoding and remove html tags
                    text = html.unescape(text)
                    if self.remove_html:
                        #HTML Stripper
                        stripper = self.__MLStripper__()
                        text = stripper.get_data()

                    # Append the text and additional metadata to the stream dictionary
                    info['text'] = text
                    info['meta']['ID'] = atb['Id']

                    if tags != None:
                        tags = self._parse_tags(tags)
                    info['meta']['Tags'] = tags

                    #yield the dictionary
                    yield info

            # clear the child from memory before moving to the next child element

I’ve tested the loader using the following code

import prodigy
from prodigy.recipes.textcat import teach
from prodigy.models.textcat import TextClassifier
from prodigy.components.sorters import prefer_uncertain, prefer_high_scores, prefer_low_scores
import plac
import spacy
#import custom loader
from utils.stackExchange import StackExchange

@prodigy.recipe('stackExchange.train', dataset=prodigy.recipe_args['dataset'], 
                query=plac.Annotation("query", 'option', 'q', str),
                file = plac.Annotation("file", 'option', 'f', str),
                content_type = plac.Annotation('con_type', 'option', 'c', str),
                long_text = plac.Annotation('long_text', 'option', 'l', bool)
def stackExchange(dataset, spacy_model, query, label, file, content_type='post_body', long_text=False):
    stream = StackExchange(file, content_type=content_type, remove_html=True)
    nlp = spacy.load(spacy_model)
    label = ','.join(label)
    model = TextClassifier(nlp, label.split(','), long_text=long_text)
    #stream = prefer_high_scores(model(stream))
    #stream = prefer_uncertain(model(stream))
    #stream = prefer_low_scores(model(stream))
    stream = (eg for score, eg in model(stream))
    #components = teach(dataset=dataset, spacy_model=spacy_model, source=stream, label=','.join(label))
    #return components
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'update': model.update,
        'config': {'lang': nlp.lang, 'labels': model.labels}

and the following command line

python -m prodigy stackExchange.teach pat ./Models/PAT --label "Azure SQL" -f "./Data/StackExchange/cs.stackexchange.com/Posts.xml" -c "post_body" -F se_test.py


Wow, this is amazing – thanks so much for sharing your code! :pray: So you’d be okay with having this integrated into the core library? If so, we might have to ask you to fill in a simple contributor agreement.

Your code looks good and very similar to the built-in loaders, actually. The built-in loaders are all classes that are initialised with optional loader-specific settings and an __iter__ method that yields the tasks. If the corpus or API response exposes the total number of results upfront, I’ve also added a __len__ method. (If a recipe doesn’t resort the stream and doesn’t expose a custom progress function, Prodigy checks whether the stream exposes a __len__ attribute, and uses that to calculate the progress. But of course, this only works makes sense if the stream is finite and/or we know its length upfront.)

Not at the moment – but it’d definitely be nice to make this possible! I’m not sure what the best solution would be – we could try solving this via entry points, similar to what we’re planning for custom recipes. We could also consider allowing the --loader argument to point to a file instead?

prodigy ner.teach dataset model posts.xml --loader loaders.py::StackExchange

In theory, this is possible – but it might be a little hacky. If there’s an easy way to extract the LaTex markup from the text, you could, for example, use a custom HTML template for this and add a "html" key to the task (instead of the "text"). You could then use a library to convert it to an image – ideally an SVG, because you can simply inline this with the HTML. You can then replace the markup with the image and produce tasks that look like this:

{"html": "Text text <svg>...</svg> text text"}

If you can only convert it to a JPG or PNG, the whole thing is a little tricker – but not impossible. You could save out the files and include them as <img> tags – but this is pretty inconvenient, especially if you do that while processing the stream. So ideally, you’d also want to inline the image with the markup. For Prodigy’s image recipes, I’ve written a few helper functions to convert images to base64-encoded data URIs (those are currently internals):

from prodigy.util import img_to_b64_uri

mimetype = 'image/png'
data_uri = img_to_b64_uri(image_bytes, mimetype)
# this will produce a string like: ...
image_html = '<img src="{}" />'.format(data_uri)

Depending on the image size, this can easily add some undesired bloat to your datasets, though. In general, I’d always recommend keeping a "text" key containing the original string in addition to the "html" property. This way, you’ll always be able to relate the generated HTML back to the original text. (It’ll also make debugging easier if something goes wrong during conversion.)


I’ll look into adding a __len__ method if possible. StackExchange provides their data as 7zip Archives using the Bzip2 compression algorithm for individual files. While most files in the archive are small enough to read entirely into memory, a few are truly massive once decompressed (10Gb - 60 Gb). This is why I went with a streaming approach, but it also makes calculating a length more difficult.

Also, I’m still working on a way to work with the 7z archive directly, so that the user doesn’t have to decompress the individual files first. This would also allow for combining the posts with the relevant comments. It’s the 7z archive format I’m wrestling with at the moment. Once I get that figured out, it should be relatively easy to add the ability to stream in the decompressed xml files for parsing.

I’ll go ahead and add an HTML property to the output stream. Eventually I’ll get around to figuring out how to use custom HTML templates with Prodigy. But it’s a simple add that may come in handy for someone else.

Finally, how would you like to work with my code? I’m planning on pushing it to a public Github repository and versioning changes moving forward. Does this work for you?

Thanks again for your work on these excellent tools.

Okay cool – I’ll get a contributor agreement ready. Something simple like the spaCy Contributor Agreement should probably do.

Yes, this makes a lot of sense and we definitely prefer the streaming approach. Don’t bother with the __len__ if it means actually counting the texts – it only makes sense if, say, the headers exposed a total count that was easy to extract upfront. For very large corpora like this, using the total length to calculate the progress isn’t that useful anyways, since you likely won’t be annotating the whole thing at once. It’s much more important that Prodigy is able to present the first tasks for annotation as quickly as possible.

I just had another idea for the LaTex / mathematical markup conversion: Instead of including this with the loader, you could also make it a preprocessor function that wraps a stream – like the built-in split_sentences or fetch_images (which converts all images, paths and URLs in the stream to data URIs). This would let you and others reuse the function with other loaders and and streams from different sources.

def add_latex_html(stream):  # maybe needs a better name?
    for eg in stream:  # iterate over the examples
        # convert the example text to HTML and replace markup, if found
        html = CONVERT_TEXT_TO_HTML_WITH_IMAGES(eg['text'])
        eg['html'] = html  # add a html key to the task
        yield eg

The preprocessor could then be used like this:

stream = JSONL('/path/to/your_data.jsonl')  # or any other loader
stream = add_latex_html(stream)

Yes, this sounds good! We’re also working on getting a prodigy-recipes repository up (see here) – this should hopefully make it easier for users to contribute to the built-in recipes, and share their custom recipes and helper functions with the community.

Thanks again for your great work :pray:

@ines Sorry for the late response, I was in an accident and have been recovering.

Here is my code repo.

While I’m laid up in bed I’ll see if I can get a functioning LaTex conversion function.

@clrogers Aw, sorry to hear – hope you’re feeling better soon! :tulip:

And thanks so much for sharing your code, this looks really nice. Can’t wait to try it out!