Dear support, I like to know if it feasible with Prodigy to classify the content of a webpage: I would like to investigate the title of the page, the alt text of the images and the text of the headings. Just to make an example, if the title contains ‘cook’ and the alt title is ‘pasta’ the classification of the page is ‘cooking recipe’
It could be really useful for me to understand the process to get this type of classification.
Any suggestions are really appreciated!
spaCy and Prodigy currently assume the input data is either an image, or unstructured text. There’s currently no direct support for “somewhat structured” input such as HTML, PDF or .doc inputs. I’m not aware of any other tools that offer built-in support for these documents either.
The problem with supporting these documents is that you usually need to look at a number of instances of the input to derive the formatting conventions. Just looking at the examples one-by-one doesn’t let you use the metadata effectively. Currently, humans are much better at this task than statistical models. It’s likely possible to come up with a model that addresses this, and there’s likely to be a number of research papers you could find on related topics. However, I’m not aware of any implementations that are currently worth using. If you find something, I’d be interested to know about it.
My recommendation would be to first label a random sample by hand, so that you can evaluate whatever process you develop automatically. Prodigy’s html view will likely be helpful for this.
Once you have an evaluation, you can iterate on a solution. You might find an entirely rule-based solution works well — it depends on your data. Alternatively, you might develop rules to extract the text, and then classify the text. There are a number of ways you could do that. In some cases, applying a single classifier over the whole extracted text will be best. In other situations, labelling the text sections and applying the classifier section-by-section will be better. If the sections really change the meaning of the text, you might even want separate classifiers, with different weights per section. It’s difficult to guess what will work best, which is why a consistent evaluation is key to success.