Network applications may have bugs

ryanwesslen · August 21, 2023, 6:10pm

Per your earlier message, I think misaligned tokenization is the problem, not a bug.

In fact, we've had that same error message (reading 'start') from a similar problem:

But I think this indicates the problem could be character-based tokenization (like in Chinese) versus token-based tokenization. You may want to set character-based tokenization in your Prodigy annotations. The docs describe this:

The ner.manual recipe also lets you set a --highlight-chars flag to allow highlighting individual characters instead of only tokens. This will only store the character offsets of your annotation and won’t add a "tokens" property to the saved task.

When using character-based highlighting, annotation may be slower and there’s no guarantee that the spans you annotate map to actual tokens later on. If your goal is to train a named entity recognizer, you should consider using the same tokenizer during annotation, to make sure that your data can be used. Also see the section on efficient annotation for transformers if you’re training a transformer-based model (e.g. BERT) with subword tokenization.

Topic		Replies	Views
Error while training NER model usage , spacy , training	4	1846	September 16, 2021
TypeError: Cannot read properties of undefined (reading 'start') usage , custom , solved , relations	7	1798	January 14, 2022
ner.train-curve error on whitespace usage , ner , spacy	1	597	December 25, 2019
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	551	March 27, 2020
Random crash of NER UI while annotating ner , done , front-end	8	905	April 13, 2021

Network applications may have bugs

Related topics