So I am currently trying to tackle the task of being able to label entities in Itinerary Data from webpage sources. I have started to realize the first step is probably being able to classify the data as either an itinerary or noise.
Example: "Roundtrip Flight", "Austin (AUS) to Denver (DEN)", "Tue, May 7 - Mon, May 13", "1 ticket: 1 adult", "Traveler 1: Adult", "$77.57",
I have started trying to gather data and have a method to where I can capture a lot but I want to make sure when I am gathering it I can store it in a structure where it can be usable for prodigy and spacy. This leads me to my problem of Sentence Boundary Detection. It seems that no matter how I try to slice a single html page it always leads to fragments of the itinerary and a lot of noise. So when I am loading it into spacy I get questions trying to classify partial itineraries. Does anyone have any good strategies on preprocessing html to get it to a usable form in Prodigy? Do I even need to preprocess it or is this an opportunity for long text classification?