So I am currently trying to tackle the task of being able to label entities in Itinerary Data from webpage sources. I have started to realize the first step is probably being able to classify the data as either an itinerary or noise.
Example: "Roundtrip Flight", "Austin (AUS) to Denver (DEN)", "Tue, May 7 - Mon, May 13", "1 ticket: 1 adult", "Traveler 1: Adult", "$77.57",
I have started trying to gather data and have a method to where I can capture a lot but I want to make sure when I am gathering it I can store it in a structure where it can be usable for prodigy and spacy. This leads me to my problem of Sentence Boundary Detection. It seems that no matter how I try to slice a single html page it always leads to fragments of the itinerary and a lot of noise. So when I am loading it into spacy I get questions trying to classify partial itineraries. Does anyone have any good strategies on preprocessing html to get it to a usable form in Prodigy? Do I even need to preprocess it or is this an opportunity for long text classification?
HTML to text extraction is often difficult, especially if you’re pulling the html from a wide variety of websites. If you’re pulling it from a single source, you might want to do the extraction into an intermediate format, such as xml or json, which you know preserves the logical structure you’re interested in. That way you can generate text views from it that suit different purposes, without losing important aspects of the page.
Mark-up languages like HTML often provide important non-text information in the formatting. If you extract to plain text, you might lose information about what’s a list, what’s a section, etc. Obviously spaCy expects to work on raw text — but that doesn’t mean the non-text aspects aren’t going to be important in other ways.
I've made a parser that segments html into a list of paragraphs (if you see a new line on the HTML page then you get a new paragraph). Is that something you could use? However note what Matthew Honnibal said
I have this exact issue. I actually want some of the non-text aspects as features but I don't know how to achieve that in a spaCy framework. I've laid out my challenge in another thread.
What if you transformed HTML into some simple Markdown, e.g. replace strong and em with * and headers with # while the rest is simply transformed into raw text with line breaks. Then you could train some word vectors so the Markdown would affect the sense2vecs? Does that sound like a good or a bad idea @honnibal?
Converting to markdown could be helpful --- it might make it easier to use the structural features. I doubt that the embedding approach you suggested would work though, if I'm understanding it correctly. You want the markup to have some influence on the vectors, but not too much influence. You could maybe try embedding them, and then using their vectors as an additive feature for the paragraph? So the vector for word i would be embeddings[markup[i]] + embeddings[words[i]]