F1-score doesn't improve for larger annotation sets

Hi there!

Just to make sure we're talking about the same thing. Are you annotating companies and persons (two entities?) and do you have 5 to 1000 unique examples of each?

Is there a reason why you're stopping early?

What are you judging this on? Do you have a set validation set or does the validation set also change as you increase?

It's hard to say for sure, but it could be that by increasing the number of annotations you're also increasing the diversity of the ML task. Maybe the first few entities are much easier to detect? Do you have examples of situations where the model gets it correct and where it gets it wrong?

It's a phenomenon that I've stumbled apon a few times. This PyData talk gives one such example related to detecting programming languages in text.

A final thing that comes to mind, have you annotated this data yourself manually or with a group? Could it be that there are label errors or annotators that disagree?

Let me know!