ner-manual does not use custom tokens

Hello! Let's consider the following toy problem:

I have a jsonl data file with 1 line:

{ "text": "a:b c", "tokens": [{"text": "a:b", "start": 0, "end": 3, "id": 0}, {"text": "c", "start": 4, "end": 5, "id": 1}] }

and I'm trying to use ner.manual to label it:

prodigy ner.manual testing_2 blank:en blah.jsonl --label FOO,BAR

I was hoping it would use my custom tokenization of ["a:b" "c"], instead, it still does the standard Spacy tokenization ["a", ":", "b", "c"].

Is this expected behaviour? Is there a way to force Prodigy to use my custom tokenization?


Are you using the latest version? Prodigy's built-in token splitting should accept pre-defined tokenization (there used to be a problem but it should be fixed now) :thinking:

In the meantime, you could also just use the mark recipe if you already have pre-tokenized data. The recipe will stream in whatever is in the data and render it with a given interface. So your command could look like this:

prodigy mark your_dataset ./your_data.jsonl --view-id ner_manual --label FOO,BAR

I'm on v1.9.5 - would you be able to try out my toy example above and see if it works for you?

I just tried using mark, but it seems to just get stuck at "Loading..."

I just tried it and it seems to work fine for me – here's what I see:

You can set PRODIGY_LOGGING=basic to see what it does behind the scenes. If you don't see an error and don't see "No tasks available", and it just gets stuck at loading, double-check that you're passing in the input file correctly? If no input file is provided, Prodigy will try to read from standard input.