Network applications may have bugs

luoshengmen98 · August 21, 2023, 4:38pm

During the normal ner operation, a jsonl file was put into the ner operation of the bert model, and suddenly there was a bug.

ryanwesslen · August 21, 2023, 6:10pm

Per your earlier message, I think misaligned tokenization is the problem, not a bug.

In fact, we've had that same error message (reading 'start') from a similar problem:

But I think this indicates the problem could be character-based tokenization (like in Chinese) versus token-based tokenization. You may want to set character-based tokenization in your Prodigy annotations. The docs describe this:

The ner.manual recipe also lets you set a --highlight-chars flag to allow highlighting individual characters instead of only tokens. This will only store the character offsets of your annotation and won’t add a "tokens" property to the saved task.

When using character-based highlighting, annotation may be slower and there’s no guarantee that the spans you annotate map to actual tokens later on. If your goal is to train a named entity recognizer, you should consider using the same tokenizer during annotation, to make sure that your data can be used. Also see the section on efficient annotation for transformers if you’re training a transformer-based model (e.g. BERT) with subword tokenization.

Topic		Replies	Views
Mismatched Tokenization on NER usage , ner	2	1139	June 25, 2021
Tokenization causes glitched text usage , ner , solved	1	376	November 2, 2021
Token boundary bug in web interface ner , front-end	3	399	July 22, 2020
ner.manual gives ValueError: Mismatched tokenization. usage , ner , solved	9	1415	August 1, 2019
Span of annotation is not correct in the browser when trying to re-annotate usage , ner , done , solved	2	602	March 22, 2019

Network applications may have bugs

Related topics