Tokenizer when training without base model

hi @alvaro.marlo!

After reading your post, I think I would rephrase your question is: "how do I bring annotations used from an outside source (non-Prodigy) into Prodigy and account for tokenization needed for training?" Is that correct? If so, you can ignore the first part as I'm describing where tokenization fits in for when annotations are created in Prodigy. It may still be helpful as I assume you may want to add new annotations in the future.

When getting annotations from Prodigy, tokenization is done at the time of the annotation process. To run train, you need to have annotations into a Prodigy dataset that were already tokenized as part of the annotation process.

Let's take an example from the docs and assume you ran ner.manual and saved your data into a dataset we'll call ner_news_headlines

prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION

Notice the blank:en which is the blank English model, that is it's only the tokenizer.

You can even see from the same example in the docs where after annotating, you can look at the data where those annotations already have tokens:

prodigy db-out ner_news_headlines > ./annotations.jsonl
# annotations.jsonl
{
  "text": "Uber’s Lesson: Silicon Valley’s Start-Up Machine Needs Fixing",
  "meta": {
    "source": "The New York Times"
  },
  "_input_hash": 1886699658,
  "_task_hash": -1952856502,
  "tokens": [
    {
      "text": "Uber",
      "start": 0,
      "end": 4,
      "id": 0
    },
    {
      "text": "’s",
      "start": 4,
      "end": 6,
      "id": 1
    },
    {
      "text": "Lesson",
      "start": 7,
      "end": 13,
      "id": 2
    },
    {
      "text": ":",
      "start": 13,
      "end": 14,
      "id": 3
    },
    {
      "text": "Silicon",
      "start": 15,
      "end": 22,
      "id": 4
    },
    {
      "text": "Valley",
      "start": 23,
      "end": 29,
      "id": 5
    },
    {
      "text": "’s",
      "start": 29,
      "end": 31,
      "id": 6
    },
    {
      "text": "Start",
      "start": 32,
      "end": 37,
      "id": 7
    },
    {
      "text": "-",
      "start": 37,
      "end": 38,
      "id": 8
    },
    {
      "text": "Up",
      "start": 38,
      "end": 40,
      "id": 9
    },
    {
      "text": "Machine",
      "start": 41,
      "end": 48,
      "id": 10
    },
    {
      "text": "Needs",
      "start": 49,
      "end": 54,
      "id": 11
    },
    {
      "text": "Fixing",
      "start": 55,
      "end": 61,
      "id": 12
    }
  ],
  "_session_id": null,
  "_view_id": "ner_manual",
  "spans": [
    {
      "start": 0,
      "end": 4,
      "token_start": 0,
      "token_end": 0,
      "label": "ORG"
    },
    {
      "start": 15,
      "end": 29,
      "token_start": 4,
      "token_end": 5,
      "label": "LOCATION"
    }
  ],
  "answer": "accept"
}

Let's instead assume you're not annotating in Prodigy but bringing in annotations from somewhere else into Prodigy. The simplest way would be to run on your .json before loading to db-in using add_tokens like:

from prodigy.components.preprocess import add_tokens
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "Hello world"}, {"text": "Another text"}]
stream = add_tokens(nlp, stream, skip=True)

So you can use this script to then export your stream to a .jsonl (e.g., use srsly.write_jsonl("/path/to/file.jsonl", stream)).

Now what's very important (and sometimes overlooked): what tokenizer did you use to get your spans? If you used something other than spaCy, you will likely get mismatched tokenization which can become a headache.

Here's a post that describes and has a way to detect examples that are mismatched:

You may find only a small number of mismatches occur -- but even if there's one, you will likely get an error message. There are many other related issues (40+ that mention "token mismatch") as well as an open source library for tokenization mismatch and alignment too that can help.