NER on long texts

I would like to train a NER on scientific texts. The goal is to label different physical properties. The texts are very long and contain the corresponding properties only in certain places. Is it advisable to split the texts? Here would be only in very few parts then a corresponding entity to find. How many words/sentences are optimal for prodigy?

Hi @yllwpr,
It is definitely advisable to split texts for NER task annotation. In fact ner.correct and ner.teach recipes split texts into sentences by default. There are three main reasons for this: 1) NER models usually learn and infer based on a fairly narrow window of tokens 2) Prodigy workflows with a model in the loop are more efficient if the tasks are smaller 3) smaller text chunks make for a less taxing annotation task.
That said, it is possible to modify the UI to accommodate longer chunks of text as explained here.

Given that you expect your target entities to be sparse, the recommended workflow would probably be to:

  1. annotate a small gold-standard corpus either fully manually or with the help of patterns using ner.manual recipe with --patterns option
  2. train an initial model based on your initial gold-standard corpus using train recipe
  3. use the model trained in step 2 it in the ner.teach recipe to suggest the most relevant examples to annotate in the rest of your original corpus.
  4. Train the final model on the full training set.
1 Like