Can recognised named entities be used as features for price prediction using ML? (Named entities treated as categorical data converted into integer data using one-hot encoding for prediction)


We run a third-party marketplace for used cars. We would like to use ML to estimate the prices based on both structured attributes and unstructured descriptions provided by the car owners to identify cars that are under-priced. If a car is under-priced, we can purchase and resell.

We wonder if can we use Prodigy to annotate the unstructured description for entity recognition and use the extracted entities as the features for price prediction.

Thank you.

Hi Stellen,

I think what you're interested in should be possible, but it will likely be a combination of a relatively simple regression model for the financials, with some features predicted from the unstructured text. So there's an interplay between two models there: someone will need to be designing the trade-off between what you can get out of the text accurately and easily, and what information is actually useful in the pricing model. I think the project will require a lot of domain expertise combined with some amount of prior NLP experience.

For example, you might find that sellers with certain demographic features often have underpriced listings ("a little old lady who just drove to church on Sundays), but simultaneously, savy sellers try to pose as these groups. Maybe something in the text gives this away, and you can figure out to remove the misleading demographic information for those listings. Maybe. But maybe there are no cases like this where the text features help, or you're unable to predict the feature accurately even if you can think of it.

I really can't say whether the project will be successful, but I think Prodigy would be a good tool for the NLP component, as it's well suited for rapid development.

Hi Honnibal.

Many thanks for your advice. They did reinforce our idea on how to go about implementing this project and gave us an idea or two.

My apology that I didn't make it clear in the question. Rather than "read between lines" using NLP, we have dataset of listings labelled with ground truth prices. We will model relationships between the prices and the corresponding unstructured and structured data using ML.

We hope our machine will eventually be intelligent enough to price the listed car with certain level of accuracy. The difference between the price estimated by the machine and the price listed by the user suggests under/over-pricing.

After further research, I believe we can use Prodigy and NER to extract certain features from the unstructured descriptions to serve as structured predictor variables to be input into the training model for price prediction.

And noted about Prodigy being a tool for rapid development. We do need such tool to iterate through different methods.

Thank you.