Custom HTML template

I have a corpus of HTML documents plus some additional metadata like category and title. I want to classify these using prodigy and spaCy.

It is easiest to label the documents while keeping the HTML formatting so right now I am thinking of presenting the report as HTML but under the hood I want to transform it to plain text and do the classification on that + the metadata.

  1. I am imagining that I need to create my own custom recipe with view_id = html and with a corresponding html_template. Correct?. Does there exist a full example of those kind of recipes?
  2. Is it possible to label the data before preprocessing and then preprocess + train under the hood to achieve a smart teach recipe that presents me the 50/50 cases.
  3. What is the best way to use metadata in spaCy? Atm. I am imagining I am just going to append it to the text, but there might be a smarter way?
  4. I am streaming my HTML documents from elasticsearch. Can prodigy still be smart about which ones to classify in teach recipes? Does it select some out of a batch? Right now I’ve created a generator of Doc.search().scan() from elasticseach_dsl.

Yep! Here's an end-to-end project I really liked that uses a custom HTML interface to embed an audio player:

https://twitter.com/pmbaumgartner/status/1074647491503480832

In theory, there is – but it'd require you to write your own custom model implementation using different features, e.g. in Thinc. So appending it to the text is definitely the easiest solution at this point.

If you treat this as a basic text classification task, you might even be able to do this almost out-of-the-box with textcat.teach. The text classification annotation model in Prodigy will look at a task's "text" – so in your incoming data, you could make the "text" the full raw concatenated text with the meta. But you could add other separate properties to your task and then use those in the HTML template. For example:

{
    "text": "This is a text. Some metadata. Meta meta.",
    "orig_text": "This is a text.",
    "meta1": "Some metadata.",
    "meta2": "Meta meta."
}
<strong>{{orig_text}}</strong>
<p>{{meta1}}</p>

So under the hood, the text classifier will predict on the raw text, but you'll be seeing the nicely-formatted HTML version.

Yes, that sounds good!

By default, the sorters (the functions like prefer_uncertain that decide whether to send an example out or not) use an exponential moving average for this. Streams are pretty much always generators, so we only ever get to see one batch at a time and can't just sort the entire stream upfront. However, we can kinda keep track of what scores we're seeing and make sure we don't get stuck and never send anything out, or send everything out always. For example, if your examples are consistently scored very low, the sorter will eventually start sending out lower examples as well that it might have otherwise skipped.

1 Like

Thanks a lot for your thoughts on this.

It works like a charm. Or it did. Now I am getting No tasks available after labelling 471 documents. I have ~55k in total. My recipe is

@prodigy.recipe('exchange_statement_cat_teach',
                spacy_model=prodigy.recipe_args['spacy_model'],
                label=prodigy.recipe_args['label_set'])
def exchange_statement_cat_teach(spacy_model='en', label='earnings'):
    """Custom wrapper for ner.teach recipe that replaces the stream."""
    dataset = 'exchange-statement-tag'

    stream = ({
        'text': preprocess_for_tagging(report),
        'body': report.body,
        'tags': ', '.join(report.source.tags),
        'title': report.title,
        'meta': {
            'id': report.meta.id
        }
    } for report in Report.search().scan())

    components = textcat.teach(dataset=dataset, spacy_model=spacy_model,
                               source=stream, label=label)
    components.update(
        view_id='html',
        config=dict(
            html_template='<p><strong>Source tags</strong>: {{ tags }}</p><p><strong>Title</strong>: {{ title }}</p><br><br><p>{{{ body }}}</p>',
        )
    )

    return components

Any idea whats wrong here?

Looks good! Two things come to mind here:

  1. You are running textcat.teach, which will select the most relevant examples – by default, the ones with a score closest to 0.5. So it makes sense that you’re not going to see every example, and that many examples will be skipped because the scores are very low or very high. 471 out of 55k does sound pretty low, though.

  2. I don’t know how elasticseach_dsl and the methods you’re calling work under the hood, but it might be worth checking and debugging if it’s really getting all of your records and if it’s fetching them consitently etc. If it ever raises a StopIteration (i.e. the generator’s way of saying “I’m done”), it will end the stream, and Prodigy will tell you that no tasks are available anymore – because well, there aren’t. The stream generator you wrote looks good, but almost a little too good and straightforward to be true :wink:

Alright thanks again. I think it might be related to textcat.teach then. In fact my data is very skewed with a few positives only - I’m guessing 1/50. And I just started the model from scratch without the model knowing anything about a positive or negative at all. That might be the issue.

The stream generator works. At least I get i = 55.726 if I run

for i, r in enumerate(Report.search().scan()):
    pass