transformer tokenizer

Hi, so the transformer tokenizer recipe expects the following format:

{
  "text": "Justin Bieber - agree 100000%",
  "tokens": [
    {"text": "[CLS]", "id": 0, "start": 0, "end": 0, "tokenizer_id": 101, "disabled": true, "ws": true},
    {"text": "justin", "id": 1, "start": 0, "end": 6, "tokenizer_id": 6796, "disabled": false, "ws": true},
    {"text": "bi", "id": 2, "start": 7, "end": 9, "tokenizer_id": 12170, "disabled": false, "ws": false},
    {"text": "eber", "id": 3, "start": 9, "end": 13, "tokenizer_id": 22669, "disabled": false, "ws": true},
    {"text": "-", "id": 4, "start": 14, "end": 15, "tokenizer_id": 1011, "disabled": false, "ws": true},
    {"text": "agree", "id": 5, "start": 16, "end": 21, "tokenizer_id": 5993, "disabled": false, "ws": true},
    {"text": "1000", "id": 6, "start": 22, "end": 26, "tokenizer_id": 6694, "disabled": false, "ws": false},
    {"text": "00", "id": 7, "start": 26, "end": 28, "tokenizer_id": 8889, "disabled": false, "ws": false},
    {"text": "%", "id": 8, "start": 28, "end": 29, "tokenizer_id": 1003, "disabled": false, "ws": true},
    {"text": "[SEP]", "id": 9, "start": 0, "end": 0, "tokenizer_id": 102, "disabled": true, "ws": true}
  ],
  "spans": [
     {"start": 0, "end": 13, "token_start": 1, "token_end": 3, "label": "PERSON"}
  ]
}

However, the tokenizer I am using is different, it does not offer the tokenizer id, are the keys within the tokens list of dictionary are necessary for the UI to function?
Thanks,

Yes, the "tokens" is definitely required so Prodigy knows where the tokens start and end, and how they map into the text. The existing recipe should take care of doing the conversion and you should be able to adapt it from here:

Alternatively, if you're using spaCy v3, the tokenization alignment happens automatically and you can just use the built-in recipes like ner.manual or ner.correct and then train a transformer-based pipeline with spaCy using embeddings of your choice.

1 Like