project NER help

Dear prodigy-yodas,

I am a linguist investigating the representation of sentiments in a specific group of literary works in German, in relation/proximity to certain entities I hope to extract.

I need therefore to train a model to recognise first of all two or more types of entities (potentially sub-categories of GPE and LOC), and, as a following step, to train the model to recognise also several types of sentiments.

  1. So a first question would be, is it possible to train a model to recognise both entities and sentiments with prodi.gy?

In terms of entities, I need to be able to find GPE on the one hand, and what I would call "NAT_TERMS" o the other, i.e. general terms that describe natural locations and objects such as "mountain", "river", "path", "valley", "rock" and so on. I thought the latter would enter the already existing category "LOC", but it does not seem to be the case.

  1. What is then exactly the difference between GPE and LOC? Especially as cities or countries seem to be detected in spacy as LOC, rather than GPE, as I expected? (talking about the German large model)

With this premises, and given that I have:

• a folder with many .txt files (my corpus)
• lists of NAT_TERMS and GPE words that I want the model to recognise (can be cvs or json)

  1. Is there an easy way to use directly the txt files in a folder as dataset, or do I have to convert them to a jsonl with the desired dataset structure, (potentially including metadata)?
    What if I wanted to use spacy nlp to all my texts and split them into paraghraphs, and then use these as my dataset?

Lastly, just to make sure I understood, is it correct that the whole process in my case could be simplified as follows?

  1. use a small portion of the corpus to teach the model new entities
  2. use a small portion of the corpus to teach the model sentiments
  3. train the model(s)
  4. apply it to the large corpus and explore the representation of sentiments in proximity to entities

Thanks!
Best
Grig.

Hi Grig,

Thanks for the detailed questions. I hope my replies can help.

Is it possible to train a model to recognise both entities and sentiments with prodi.gy?

You would train separate models for these, and then assemble them into a single pipeline. This definitely falls within the expected usage of Prodigy.

What is then exactly the difference between GPE and LOC?

The German NER model unfortunately follows a different scheme from the English one at the moment. The German model is acquired semi-automatically from Wikipedia, and it uses the 4-class scheme (PER, LOC, ORG, MISC). The English model uses more types, which introduces the GPE vs LOC distinction.

Is there an easy way to use directly the txt files in a folder as dataset, or do I have to convert them to a jsonl with the desired dataset structure, (potentially including metadata)?

I would recommend converting to JSONL, because then you can look at the JSONL data and make sure everything is correct. Overall I think it'll be simpler. If you really want to stream from the folder, you could write a script that reads the data from the folder and prints out the JSONL-formated lines. You'd then pipe the output of that script into the recipes.

Yes that's mostly correct, although I would note some addenda for completeness. First, there's no problem with using the same text to teach both the entities and the sentiments. Although these will be learned with different models, you can base the annotations on the same texts.

Second, remember that you'll want some extra data to evaluate the accuracy of the models you're training. So you'll want a portion of the corpus to teach the models entities and sentiments, and another small portion to evaluate the accuracy. Then you can run the pipeline on the corpus to do the exploration.

Depending on what analyses you're doing, it could be a good idea to compare how the inaccuracies introduced by the model change the conclusions, by running the analysis on both the manually annotated version of the evaluation data, and a version of the evaluation data annotated with the model.

Finally, when you're doing your evaluation, you probably want to pay attention to some aspects that aren't captured well by normal "accuracy" metrics. You probably want to look at whether the error profile biases your results, rather than simply whether it's accurate or inaccurate. For instance, if there's a particular type of example the model always gets wrong, that's worse for you than if the error pattern were more uniform, as uniformly distributed errors won't affect your statistics so much.

Many thanks! I’ll try :slightly_smiling_face: