custom tokenizer review recipe problem

kak-to-tak · February 25, 2020, 10:39am

Hi,
there is a problem with custom tokenizer in review recipe. Main text is tokenized with custom tokenizer (bpe), but the text below (annotations from different sessions) are tokenized with default one, there's a mismatch and wrong words are highlighted making it impossible to check the results.
Is it possible to solve this?

ines · February 25, 2020, 11:06am

Hi! Could you share the JSON of that review example (e.g. via saving it to the dataset and running db-out)?

And just to make sure I understand the problem correctly: How were the annotations created and when was the text tokenized with a custom tokenizer vs. spaCy's default tokenization?

kak-to-tak · February 25, 2020, 11:50am

The bpe tokenizer was used during the annotation process with ner.custom recipe and when calling review recipe this happens.
You can call the following file with review recipe and see.
prices_ner_review.jsonl (25.5 KB)

Topic		Replies	Views
Custom ner recipe doesn't work with patterns ner	10	631	April 9, 2020
recipe proposing list of custom chosen sentences for manual new usage , ner , custom , solved	4	1095	January 21, 2018
Custom recipe w/o model usage , ner , solved	2	673	April 18, 2018
Mismatched tokenization	1	507	September 13, 2022
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1334	March 11, 2020

custom tokenizer review recipe problem

Related topics