Have you looked at this thread that Ines linked earlier for dealing with non-matching tokens? Matching tokenisation on pre-existing annotated data
The other option is indeed to define the tokens
in your input data - then the tokenizer won't run and will just take the tokens as you've defined them. You can see the expected format here: https://prodi.gy/docs/api-interfaces#ner_manual
{
"text": "First look at the new MacBook Pro",
"spans": [
{"start": 22, "end": 33, "label": "PRODUCT", "token_start": 5, "token_end": 6}
],
"tokens": [
{"text": "First", "start": 0, "end": 5, "id": 0},
{"text": "look", "start": 6, "end": 10, "id": 1},
{"text": "at", "start": 11, "end": 13, "id": 2},
{"text": "the", "start": 14, "end": 17, "id": 3},
{"text": "new", "start": 18, "end": 21, "id": 4},
{"text": "MacBook", "start": 22, "end": 29, "id": 5},
{"text": "Pro", "start": 30, "end": 33, "id": 6}
]
}
(you don't need to have the spans
predefined. If not, you have to annotate them all yourself)