Hi,
there is a problem with custom tokenizer in review recipe. Main text is tokenized with custom tokenizer (bpe), but the text below (annotations from different sessions) are tokenized with default one, there's a mismatch and wrong words are highlighted making it impossible to check the results.
Is it possible to solve this?
Hi! Could you share the JSON of that review example (e.g. via saving it to the dataset and running db-out
)?
And just to make sure I understand the problem correctly: How were the annotations created and when was the text tokenized with a custom tokenizer vs. spaCy's default tokenization?
The bpe tokenizer was used during the annotation process with ner.custom recipe and when calling review recipe this happens.
You can call the following file with review recipe and see.
prices_ner_review.jsonl (25.5 KB)