Best approach for master thesis

luri · November 3, 2020, 6:52pm

Hey all,
I'm currently writing my master thesis at RWTH Aachen and came across a question of which approach to choose.
Roughly my master thesis is about making use of unstructured data for SMEs via NLP and Machine Learning.
Right now I need to analyze texts consisting of product descriptions for the manufacturing process and cluster them by similarity.
Here's an example:

But unfortunately not all the text is structured like this. You also have texts like:
Floatglas 8 mm Format 740 x740 mm Kanten poliert Ecken gestoßen inkl. EMZ
or
kundeneigener Kristallspiegel silber 5 mm Modell - Fünfeck mit 2 x re. Winkeln Format: 1430 x 899 mm alle Modellkanten poliert durch uns von unten Ausflinsung nachpoliert auf: 1 Stück 1430 x 897 mm ohne Berechnung aus Kulanz!
And of course typos, etc. because it's human generated text.

My question is which approach is best to use for that issue. Train a new model in prodigy from scratch, working with regular expressions, continue in spaCy where I started with or with something else?

I really hope you guys can give me some advice on this.
Thanks in advance!

honnibal · November 6, 2020, 7:05am

Hi Lucas,

This is the type of question you'll have to decide for yourself as part of your research, and explain why you took the approach you did. One consideration is whether the regular expressions can be of help to other people, in which case they can be included as a contribution. If you're just looking to get the annotations completed quickly, then it really comes down to how you find the different workflows, and how your data is behaving.

Topic		Replies	Views
Stuck training some NER models (newbie) usage , ner , best-practices	2	1027	July 16, 2020
Best Approach for My Project ner , spacy , project , best-practices	3	648	March 10, 2022
I'm new to python and NLP. I would like to evaluate Prodigy and need guidance on getting started. usage , best-practices	3	562	February 16, 2021
Book usage	1	394	March 4, 2022
Will NER improve Text Categorization?	2	413	July 18, 2022

Best approach for master thesis

Related topics