Hey all,
I'm currently writing my master thesis at RWTH Aachen and came across a question of which approach to choose.
Roughly my master thesis is about making use of unstructured data for SMEs via NLP and Machine Learning.
Right now I need to analyze texts consisting of product descriptions for the manufacturing process and cluster them by similarity.
Here's an example:
But unfortunately not all the text is structured like this. You also have texts like:
Floatglas 8 mm Format 740 x740 mm Kanten poliert Ecken gestoßen inkl. EMZ
or
kundeneigener Kristallspiegel silber 5 mm Modell - Fünfeck mit 2 x re. Winkeln Format: 1430 x 899 mm alle Modellkanten poliert durch uns von unten Ausflinsung nachpoliert auf: 1 Stück 1430 x 897 mm ohne Berechnung aus Kulanz!
And of course typos, etc. because it's human generated text.
My question is which approach is best to use for that issue. Train a new model in prodigy from scratch, working with regular expressions, continue in spaCy where I started with or with something else?
I really hope you guys can give me some advice on this.
Thanks in advance!