Tokenizer when training without base model

ryanwesslen · December 14, 2022, 3:58pm

After reading your post, I think I would rephrase your question is: "how do I bring annotations used from an outside source (non-Prodigy) into Prodigy and account for tokenization needed for training?" Is that correct? If so, you can ignore the first part as I'm describing where tokenization fits in for when annotations are created in Prodigy. It may still be helpful as I assume you may want to add new annotations in the future.

When getting annotations from Prodigy, tokenization is done at the time of the annotation process. To run train, you need to have annotations into a Prodigy dataset that were already tokenized as part of the annotation process.

Let's take an example from the docs and assume you ran ner.manual and saved your data into a dataset we'll call ner_news_headlines

prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION

Notice the blank:en which is the blank English model, that is it's only the tokenizer.

You can even see from the same example in the docs where after annotating, you can look at the data where those annotations already have tokens:

prodigy db-out ner_news_headlines > ./annotations.jsonl

# annotations.jsonl
{
  "text": "Uber’s Lesson: Silicon Valley’s Start-Up Machine Needs Fixing",
  "meta": {
    "source": "The New York Times"
  },
  "_input_hash": 1886699658,
  "_task_hash": -1952856502,
  "tokens": [
    {
      "text": "Uber",
      "start": 0,
      "end": 4,
      "id": 0
    },
    {
      "text": "’s",
      "start": 4,
      "end": 6,
      "id": 1
    },
    {
      "text": "Lesson",
      "start": 7,
      "end": 13,
      "id": 2
    },
    {
      "text": ":",
      "start": 13,
      "end": 14,
      "id": 3
    },
    {
      "text": "Silicon",
      "start": 15,
      "end": 22,
      "id": 4
    },
    {
      "text": "Valley",
      "start": 23,
      "end": 29,
      "id": 5
    },
    {
      "text": "’s",
      "start": 29,
      "end": 31,
      "id": 6
    },
    {
      "text": "Start",
      "start": 32,
      "end": 37,
      "id": 7
    },
    {
      "text": "-",
      "start": 37,
      "end": 38,
      "id": 8
    },
    {
      "text": "Up",
      "start": 38,
      "end": 40,
      "id": 9
    },
    {
      "text": "Machine",
      "start": 41,
      "end": 48,
      "id": 10
    },
    {
      "text": "Needs",
      "start": 49,
      "end": 54,
      "id": 11
    },
    {
      "text": "Fixing",
      "start": 55,
      "end": 61,
      "id": 12
    }
  ],
  "_session_id": null,
  "_view_id": "ner_manual",
  "spans": [
    {
      "start": 0,
      "end": 4,
      "token_start": 0,
      "token_end": 0,
      "label": "ORG"
    },
    {
      "start": 15,
      "end": 29,
      "token_start": 4,
      "token_end": 5,
      "label": "LOCATION"
    }
  ],
  "answer": "accept"
}

Let's instead assume you're not annotating in Prodigy but bringing in annotations from somewhere else into Prodigy. The simplest way would be to run on your .json before loading to db-in using add_tokens like:

from prodigy.components.preprocess import add_tokens
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "Hello world"}, {"text": "Another text"}]
stream = add_tokens(nlp, stream, skip=True)

So you can use this script to then export your stream to a .jsonl (e.g., use srsly.write_jsonl("/path/to/file.jsonl", stream)).

Now what's very important (and sometimes overlooked): what tokenizer did you use to get your spans? If you used something other than spaCy, you will likely get mismatched tokenization which can become a headache.

Here's a post that describes and has a way to detect examples that are mismatched:

You may find only a small number of mismatches occur -- but even if there's one, you will likely get an error message. There are many other related issues (40+ that mention "token mismatch") as well as an open source library for tokenization mismatch and alignment too that can help.

Topic		Replies	Views
Problem in training the model usage , ner	10	598	May 26, 2020
Fully manual NER annotations without tokeniser enhancement , ner , done	3	996	June 17, 2020
`prodigy train` doesn't seem to use the tokenizer from base-model training	2	307	May 1, 2023
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1339	March 11, 2020
Basic question about Prodigy annotations and model training. usage , ner	12	751	January 18, 2019

Tokenizer when training without base model

Related topics