Best approach for master thesis

Hey all,
I'm currently writing my master thesis at RWTH Aachen and came across a question of which approach to choose.
Roughly my master thesis is about making use of unstructured data for SMEs via NLP and Machine Learning.
Right now I need to analyze texts consisting of product descriptions for the manufacturing process and cluster them by similarity.
Here's an example:
image
But unfortunately not all the text is structured like this. You also have texts like:
Floatglas 8 mm Format 740 x740 mm Kanten poliert Ecken gestoßen inkl. EMZ
or
kundeneigener Kristallspiegel silber 5 mm Modell - Fünfeck mit 2 x re. Winkeln Format: 1430 x 899 mm alle Modellkanten poliert durch uns von unten Ausflinsung nachpoliert auf: 1 Stück 1430 x 897 mm ohne Berechnung aus Kulanz!
And of course typos, etc. because it's human generated text.

My question is which approach is best to use for that issue. Train a new model in prodigy from scratch, working with regular expressions, continue in spaCy where I started with or with something else?

I really hope you guys can give me some advice on this.
Thanks in advance!

Hi Lucas,

This is the type of question you'll have to decide for yourself as part of your research, and explain why you took the approach you did. One consideration is whether the regular expressions can be of help to other people, in which case they can be included as a contribution. If you're just looking to get the annotations completed quickly, then it really comes down to how you find the different workflows, and how your data is behaving.

1 Like