ner-manual does not use custom tokens

edu · January 28, 2020, 2:48pm

Hello! Let's consider the following toy problem:

I have a jsonl data file with 1 line:

{ "text": "a:b c", "tokens": [{"text": "a:b", "start": 0, "end": 3, "id": 0}, {"text": "c", "start": 4, "end": 5, "id": 1}] }

and I'm trying to use ner.manual to label it:

prodigy ner.manual testing_2 blank:en blah.jsonl --label FOO,BAR

I was hoping it would use my custom tokenization of ["a:b" "c"], instead, it still does the standard Spacy tokenization ["a", ":", "b", "c"].

Is this expected behaviour? Is there a way to force Prodigy to use my custom tokenization?

Thanks!

ines · January 28, 2020, 5:52pm

Are you using the latest version? Prodigy's built-in token splitting should accept pre-defined tokenization (there used to be a problem but it should be fixed now)

In the meantime, you could also just use the mark recipe if you already have pre-tokenized data. The recipe will stream in whatever is in the data and render it with a given interface. So your command could look like this:

prodigy mark your_dataset ./your_data.jsonl --view-id ner_manual --label FOO,BAR

edu · January 28, 2020, 11:46pm

I'm on v1.9.5 - would you be able to try out my toy example above and see if it works for you?

I just tried using mark, but it seems to just get stuck at "Loading..."

ines · January 29, 2020, 5:30pm

I just tried it and it seems to work fine for me – here's what I see:

You can set PRODIGY_LOGGING=basic to see what it does behind the scenes. If you don't see an error and don't see "No tasks available", and it just gets stuck at loading, double-check that you're passing in the input file correctly? If no input file is provided, Prodigy will try to read from standard input.

Topic		Replies	Views
ner.manual gives ValueError: Mismatched tokenization. usage , ner , solved	9	1417	August 1, 2019
Prodigy tokenizing even when not supposed to? ner , done	1	544	August 16, 2019
ner.train on data not annotated by Spacy? ner	3	1152	June 11, 2018
Anotation task format for ner_manual interface usage , ner , solved	7	1793	May 10, 2019
Custom ner recipe doesn't work with patterns ner	10	636	April 9, 2020

ner-manual does not use custom tokens

Related topics