Recommended text length for training NER models in spaCy

Hello @koaning ,

I happen to be collaborating with @agustinadinamarca for a couple of weeks, and regarding the same question she posted originally, I would add the following queries:

  1. Other than "being easier to annotate", is there any other advantage in using shorter texts, before larger ones?
    1.1. PS.: I know the rule of thumb here is "try and see", but in your own experience, have notice some trend?
    1.2. In my own experience, I have trained some spaCy NER models, but I have only worked with rather small texts (<=25 words), and they tend to be "highly accurate on the first trials, but prone to catastrophic forgetting", however nothing like what we have right now. You might be thinking "well, just trim the extension", but point "2" below provides more details about this, I appreciate your patience :smile:.
    1.3. Prodigy provides some tools to make experimentation easier, for instance train-curve. By using it, we have preliminarily concluded that, in our case and with the current text length (~425 words), "we seem to need WAY more samples" (i.e., find here a quick, actual test result). Regarding "how many more samples do we need", we will tag ~4500 texts more, and see if afterwards it does indeed improve the model metrics (we are a small team now, so even when that batch seems like small, it still involves a considerable time budget for us :wink:); however we know many more samples could be needed.
  2. We are also hesitant about reducing the "text word count" because for our own use case, the labeled entities are a bit scarce in average, in the text (for reference, I'd say that there are <=10 entites in those 425 words mentioned above), and "by reducing the text word count" we are exposing ourselves into having lots of rejected texts during the labeling, as there will be chunks of it without any entity.

I hope that after these clarifications, you can help us with some newer / further recommendations, they will be very well appreciated.

Thank you very much!