I have a corpus of HTML documents plus some additional metadata like
title. I want to classify these using prodigy and spaCy.
It is easiest to label the documents while keeping the HTML formatting so right now I am thinking of presenting the report as HTML but under the hood I want to transform it to plain text and do the classification on that + the metadata.
- I am imagining that I need to create my own custom recipe with
view_id = htmland with a corresponding
html_template. Correct?. Does there exist a full example of those kind of recipes?
- Is it possible to label the data before preprocessing and then preprocess + train under the hood to achieve a smart
teachrecipe that presents me the 50/50 cases.
- What is the best way to use metadata in spaCy? Atm. I am imagining I am just going to append it to the text, but there might be a smarter way?
- I am streaming my HTML documents from
elasticsearch. Can prodigy still be smart about which ones to classify in
teachrecipes? Does it select some out of a batch? Right now I’ve created a generator of