How to train custom NER with preannotations?

I wan't to create a custom NER annotator recipe with pre-annotated values, similar of ner.manual adding patterns. Which values should return the stream to allow this functionality?

Hi! You can find an example of the expected JSON format that Prodigy creates for NER here:

The most importan parts are the "text" and "spans", describing the character offsets of the entities. So a pre-annotated example you can create could look like this:

  "text": "First look at the new MacBook Pro",
  "spans": [
    {"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}

You can also export data in this format as JSONL and load it into ner.manual, and Prodigy will respect the existing annotations.

If you're using a custom recipe, you can call Prodigy's add_tokens helper to automatically add the "tokens" and span token indices, so you won't have to do this manually and be sure that they match the model's tokenization. So your logic could look like this:

stream = JSONL(source)  # or however you load the raw data
stream = add_your_preannotations(source)
stream = add_tokens(nlp, stream)

Thanks a lot! We finally solve this problem with a similar solution.