F1-score doesn't improve for larger annotation sets

We trained an NER model based on Spacy EN large, using Prodigy annotation sets ranging from 5 to 1000 annotated entities (organizations and persons from ENRON dataset). The learning rate is 0.003 (this we found to be the optimal value), the maximum step is 110, maximum epoch: 10.

However, we found that using a larger annotation set did not result in a better performance.

In the plot you see F1-score for 5-1000 annotated entities. The blue bars depict the F1-score with considering the label. The orange bars show the F1-score w/o considering the label.

This raises the question of whether this is normal or if there might be other factors at play. Initially, we expected that including more entities would lead to an improvement in the model's performance. However, this was not the case. Does anyone have a solution, why this is happening?

Hi there!

Just to make sure we're talking about the same thing. Are you annotating companies and persons (two entities?) and do you have 5 to 1000 unique examples of each?

Is there a reason why you're stopping early?

What are you judging this on? Do you have a set validation set or does the validation set also change as you increase?

It's hard to say for sure, but it could be that by increasing the number of annotations you're also increasing the diversity of the ML task. Maybe the first few entities are much easier to detect? Do you have examples of situations where the model gets it correct and where it gets it wrong?

It's a phenomenon that I've stumbled apon a few times. This PyData talk gives one such example related to detecting programming languages in text.

A final thing that comes to mind, have you annotated this data yourself manually or with a group? Could it be that there are label errors or annotators that disagree?

Let me know!

First and foremost, I want to express my sincere gratitude for your response and especially for the video you provided. It was incredibly helpful, and we truly appreciate your effort in helping us.

To provide more clarity on our project, we have been using the en_core_web_lg model to train a system that identifies persons and organizations. Our goal is to optimize the model for our use case by adapting it to our domain. Unfortunately, our data consists of emails (including headers) from the Enron Dataset, as client information is too sensitive. As a result, a significant portion of our training and evaluation data is not comprised of full sentences, and the majority of it is located in the header section. We expect the model to perform poorly on the headers, but we've also discovered that the model has difficulty identifying persons in some of the easy sentences located in the email bodies.

Our dataset contains up to 1000 entities in our dataset, with each category consisting of roughly 50%. For instance, out of 1000 entities, 505 are persons and 499 are organizations. It's worth noting that the smaller entity number is a subset of the higher entity number. If we have 600 entities, all of these entities are included in the 1000 entities, and we have 400 new ones. Each entity is unique, but they can appear more than once.

We trained our models on these entities and then tested them on a dedicated validation set that we never used for training. The validation set consists of 200 entities, with categories also comprising approximately 50%. We implemented our own validation logic, where the strings must be identical, annotated with the same location and elements in it.

Please refer to the attached picture. The first image shows our annotated gold standard, and the second picture displays the model performance after training on top of the existing persons/org tag of en_core_web_lg with 1000 entities. Grey color indicates true positive values, while red color indicates false negative values. As you can see, the model did not identify the name Jan Butler.

MicrosoftTeams-image (2)
MicrosoftTeams-image (1)

We stopped training early, with a max_step_size of 110, because we initially tried multiple combinations of training hyperparameters on 600 entities, and the max_step_size of 110 and learning rate of 0.003 yielded the best results. This does not necessarily mean that these parameters are the best for fewer or more entities, but we did not want to change too many settings when comparing the model progress.

Happy to hear it :smile:

Unfortunately, our data consists of emails (including headers) from the Enron Dataset, as client information is too sensitive.

One thing to keep in mind then: is the Enron dataset really representative of the emails that you'd like to predict with your model? The way people would write emails back then is different from now, and I could even imagine that "internet English" has evolved too. There can still be good reasons not to train on data that's too sensitive, but there is also a risk of using data from a distribution that's unlike the application domain.

dentifying persons in some of the easy sentences located in the email bodies.

Detecting a person's name tends to be a very hard problem. Back when I worked at Rasa I made a small tutorial video that expands on some of the hard things you may encounter. I'll share it below because it might be relevant to your problem. In general; spaCy models are sensitive to capitalization, which could cause issues.

If you're working with emails though ... you might be able to leverage the email adress itself. After all, most of the time some part of a persons name will be part of the email adress. So it could be helpful to see if you can do something clever there.

As you can see, the model did not identify the name Jan Butler.

In that example, is it possible that the names David and Julia do appear in the training data while Jan does not? Part of me cannot help but observe that Jan isn't a very English sounding name ... I believe it has a Dutch origin. That might be part of what you're observing here.

training hyperparameters on 600 entities, and the max_step_size of 110 and learning rate of 0.003 yielded the best results

I see. Part of me worries that at this point of the project it's better to spend time on data quality than on hyperparameters. Especially if there's some uncertainty on how applicable the Enron dataset is, I might instead worry about that first.

Just an idea: have you tried just using name-lists as a starting point? I can imagine a simple rule like "if a token from a name list appears and it is capitalised -> assume it is a name" and "if a token from a name list appears and it is capitalised as well as the following token -> assume first/last name".

If you need a source of name lists, you might enjoy these:

Let me know if this helps!