Dear prodigy-yodas,
I am a linguist investigating the representation of sentiments in a specific group of literary works in German, in relation/proximity to certain entities I hope to extract.
I need therefore to train a model to recognise first of all two or more types of entities (potentially sub-categories of GPE and LOC), and, as a following step, to train the model to recognise also several types of sentiments.
- So a first question would be, is it possible to train a model to recognise both entities and sentiments with prodi.gy?
In terms of entities, I need to be able to find GPE on the one hand, and what I would call "NAT_TERMS" o the other, i.e. general terms that describe natural locations and objects such as "mountain", "river", "path", "valley", "rock" and so on. I thought the latter would enter the already existing category "LOC", but it does not seem to be the case.
- What is then exactly the difference between GPE and LOC? Especially as cities or countries seem to be detected in spacy as LOC, rather than GPE, as I expected? (talking about the German large model)
With this premises, and given that I have:
• a folder with many .txt files (my corpus)
• lists of NAT_TERMS and GPE words that I want the model to recognise (can be cvs or json)
- Is there an easy way to use directly the txt files in a folder as dataset, or do I have to convert them to a jsonl with the desired dataset structure, (potentially including metadata)?
What if I wanted to use spacy nlp to all my texts and split them into paraghraphs, and then use these as my dataset?
Lastly, just to make sure I understood, is it correct that the whole process in my case could be simplified as follows?
- use a small portion of the corpus to teach the model new entities
- use a small portion of the corpus to teach the model sentiments
- train the model(s)
- apply it to the large corpus and explore the representation of sentiments in proximity to entities
Thanks!
Best
Grig.