Tokenizer when training without base model

I have a question related with the train recipe. If we are not completing the --base-model argument, how is completed the tokenization? I mean, how this new model created tokenize?

Prior to training, we have done a db-in with annotations without tokens, only with spans. And this is the dataset we are using for the training as the --ner argument.

Thanks!
Álvaro

hi @alvaro.marlo!

After reading your post, I think I would rephrase your question is: "how do I bring annotations used from an outside source (non-Prodigy) into Prodigy and account for tokenization needed for training?" Is that correct? If so, you can ignore the first part as I'm describing where tokenization fits in for when annotations are created in Prodigy. It may still be helpful as I assume you may want to add new annotations in the future.

When getting annotations from Prodigy, tokenization is done at the time of the annotation process. To run train, you need to have annotations into a Prodigy dataset that were already tokenized as part of the annotation process.

Let's take an example from the docs and assume you ran ner.manual and saved your data into a dataset we'll call ner_news_headlines

prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION

Notice the blank:en which is the blank English model, that is it's only the tokenizer.

You can even see from the same example in the docs where after annotating, you can look at the data where those annotations already have tokens:

prodigy db-out ner_news_headlines > ./annotations.jsonl
# annotations.jsonl
{
  "text": "Uber’s Lesson: Silicon Valley’s Start-Up Machine Needs Fixing",
  "meta": {
    "source": "The New York Times"
  },
  "_input_hash": 1886699658,
  "_task_hash": -1952856502,
  "tokens": [
    {
      "text": "Uber",
      "start": 0,
      "end": 4,
      "id": 0
    },
    {
      "text": "’s",
      "start": 4,
      "end": 6,
      "id": 1
    },
    {
      "text": "Lesson",
      "start": 7,
      "end": 13,
      "id": 2
    },
    {
      "text": ":",
      "start": 13,
      "end": 14,
      "id": 3
    },
    {
      "text": "Silicon",
      "start": 15,
      "end": 22,
      "id": 4
    },
    {
      "text": "Valley",
      "start": 23,
      "end": 29,
      "id": 5
    },
    {
      "text": "’s",
      "start": 29,
      "end": 31,
      "id": 6
    },
    {
      "text": "Start",
      "start": 32,
      "end": 37,
      "id": 7
    },
    {
      "text": "-",
      "start": 37,
      "end": 38,
      "id": 8
    },
    {
      "text": "Up",
      "start": 38,
      "end": 40,
      "id": 9
    },
    {
      "text": "Machine",
      "start": 41,
      "end": 48,
      "id": 10
    },
    {
      "text": "Needs",
      "start": 49,
      "end": 54,
      "id": 11
    },
    {
      "text": "Fixing",
      "start": 55,
      "end": 61,
      "id": 12
    }
  ],
  "_session_id": null,
  "_view_id": "ner_manual",
  "spans": [
    {
      "start": 0,
      "end": 4,
      "token_start": 0,
      "token_end": 0,
      "label": "ORG"
    },
    {
      "start": 15,
      "end": 29,
      "token_start": 4,
      "token_end": 5,
      "label": "LOCATION"
    }
  ],
  "answer": "accept"
}

Let's instead assume you're not annotating in Prodigy but bringing in annotations from somewhere else into Prodigy. The simplest way would be to run on your .json before loading to db-in using add_tokens like:

from prodigy.components.preprocess import add_tokens
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "Hello world"}, {"text": "Another text"}]
stream = add_tokens(nlp, stream, skip=True)

So you can use this script to then export your stream to a .jsonl (e.g., use srsly.write_jsonl("/path/to/file.jsonl", stream)).

Now what's very important (and sometimes overlooked): what tokenizer did you use to get your spans? If you used something other than spaCy, you will likely get mismatched tokenization which can become a headache.

Here's a post that describes and has a way to detect examples that are mismatched:

You may find only a small number of mismatches occur -- but even if there's one, you will likely get an error message. There are many other related issues (40+ that mention "token mismatch") as well as an open source library for tokenization mismatch and alignment too that can help.

Ok! It's all clear!

But one additional question regarding this:

I have annotations without tokenization (I have created it using FlashText), here an example:

{
    "text": "El Juzgado de Primera Instancia e Instrucción número tres de Huesca, en cumplimiento de lo dispuesto en el artículo 23 de la Ley Concursal, anuncia:\nPrimero.- Que en el procedimiento número 726/2007, por auto de 23 de enero de 2008 se ha declarado en concurso voluntario al deudor Tambir Al Isalma Zebunesa, con domicilio en el El Temple y cuyo centro de principales intereses lo tiene en el El Temple.\nSegundo.- Que el deudor conserva las facultades de administración y de disposición de su patrimonio, pero sometidas éstas a la intervención de la administración concursal.\nTercero.- Que los acreedores del concursado deben poner en conocimiento de la administración concursal la existencia de sus créditos en la forma y con los datos expresados en el artículo 85 de la Ley Concursal.\nEl plazo para esta comunicación es el de un mes a contar de la última publicación de los anuncios que se ha ordenado publicar en el Boletín Oficial del Estado y en el /los periódicos Heraldo de Aragón.\nCuarto.- Que los acreedores e interesados que deseen comparecer en el procedimiento deberán hacerlo por medio de Procurador y asistidos de Letrado, artículo 184.3 Ley Concursal.\nHuesca, 20 de octubre de 2009.- El/La Secretario Judicial de Juzgado de Primera Instancia e Instrucción número 3.",
    "spans":
    [
        {
            "text": "Juzgado de Primera Instancia e Instrucción número tres",
            "start": 3,
            "end": 57,
            "label": "COURT"
        },
        {
            "text": "Huesca",
            "start": 61,
            "end": 67,
            "label": "GPE"
        },
        {
            "text": "23 de enero de 2008",
            "start": 212,
            "end": 231,
            "label": "DATE"
        },
        {
            "text": "Procurador",
            "start": 1101,
            "end": 1111,
            "label": "ROLE"
        },
        {
            "text": "Letrado",
            "start": 1127,
            "end": 1134,
            "label": "ROLE"
        },
        {
            "text": "Huesca",
            "start": 1166,
            "end": 1172,
            "label": "GPE"
        },
        {
            "text": "20 de octubre de 2009",
            "start": 1174,
            "end": 1195,
            "label": "DATE"
        },
        {
            "text": "Secretario Judicial",
            "start": 1204,
            "end": 1223,
            "label": "ROLE"
        },
        {
            "text": "Juzgado de Primera Instancia e Instrucción número 3",
            "start": 1227,
            "end": 1278,
            "label": "COURT"
        }
    ]
}

I can use the function add_tokens with the argument use_chars set to True to avoid the mismatched tokenization, right?

Then, when I do the training, will the tokenization of the new model created be on the basis of is produced with the add_tokens function?

Massive thanks!

Yes - I think that may work. I haven't used the use_chars but I think that'll use the character-based tokenization.

However, this comment from the docs reflect that this could work, but I can't guarantee it'll work 100% of spans:

When using character-based highlighting, annotation may be slower and there’s no guarantee that the spans you annotate map to actual tokens later on. If your goal is to train a named entity recognizer, you should consider using the same tokenizer during annotation, to make sure that your data can be used.

If you train a model, it will then have the same spaCy's base tokenizer in your pipeline so it should would be consistent with how the add_tokens function is used.

Also, unrelated to tokenization, since you're considering prodigy train and your question on how to know what tokenizer you have in your model, you may also want to invest a little time to learn data-to-spacy and use spacy train instead of prodigy train (which is just a wrapper for spacy train). Be sure to read the data-to-spacy docs for details.

This will force you to handle your own config file (see this blog).

By doing so, you can be explicit about what tokenizer you're using like having this:

[nlp.tokenizer]
@tokenizers = "spacy.Tokenizer.v1"

prodigy train exists as an easy way to start training in Prodigy but with a default config file that may not be apparent. If you're serious about experiments, I would move immediately to learning spacy train and handlin config files because you can over time create much more customized architectures and hyperparameter tuning.

Also, be sure to check out our spaCy projects repo where there are a lot of projects where you can learn from their config files.