TypeError: Cannot read properties of undefined (reading 'start')

prodigy crash after I changed a new data set, any idea?

Update, that error happened even if I just input one sample data like this:
app_v5.jsonl (735 Bytes)
I used following script to start the service:

prodigy ent.rel.tokenizer app_v5 ./app_v5.jsonl --hide-wp-prefix --tokenizer-vocab ./data/vocab.txt -F ./preprocess_v2.py

Hi! Based on the command and screenshot, it looks like you're using a custom recipe, ent.rel.tokenizer so the problem might be related to whatever happens in the recipe. Could you share the code or some more details on what your custom recipe does?

Sure, here are the code in ent.rel.tokenizer, I just copy from ner sample code and do a little bit modifications for relation task.

from typing import List, Optional, Union, Iterable, Dict, Any
from tokenizers import BertWordPieceTokenizer
from prodigy.components.loaders import JSONL, get_stream
from prodigy.util import get_labels
import prodigy

@prodigy.recipe(
    "ent.rel.tokenizer",
    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    loader=("Loader (guessed from file extension if not set)", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    tokenizer_vocab=("Tokenizer vocab file", "option", "tv", str),
    lowercase=("Set lowercase=True for tokenizer", "flag", "LC", bool),
    add_special_tokens=("Add SEP and CLS tokens", "flag", "AS", bool),
    hide_special=("Hide SEP and CLS tokens visually", "flag", "HS", bool),
    hide_wp_prefix=("Hide wordpieces prefix like ##", "flag", "HW", bool)
    # fmt: on
)
def manual_tokenizers_bert(
    dataset: str,
    source: Union[str, Iterable[dict]],
    loader: Optional[str] = None,
    label: Optional[List[str]] = None,
    tokenizer_vocab: Optional[str] = None,
    lowercase: bool = False,
    add_special_tokens: bool = False,
    hide_special: bool = False,
    hide_wp_prefix: bool = False,
) -> Dict[str, Any]:
    """Example recipe that shows how to use model-specific tokenizers like the
    BERT word piece tokenizer to preprocess your incoming text for fast and
    efficient NER annotation and to make sure that all annotations you collect
    always map to tokens and can be used to train and fine-tune your model
    (even if the tokenization isn't that intuitive, because word pieces). The
    selection automatically snaps to the token boundaries and you can double-click
    single tokens to select them.

    Setting "honor_token_whitespace": true will ensure that whitespace between
    tokens is only shown if whitespace is present in the original text. This
    keeps the text readable.

    Requires Prodigy v1.10+ and usese the HuggingFace tokenizers library."""
#     stream = get_stream(source, loader=loader, input_key="text")
    stream = JSONL(source)
    # You can replace this with other tokenizers if needed
    tokenizer = BertWordPieceTokenizer(tokenizer_vocab, lowercase=lowercase)
    sep_token = tokenizer._parameters.get("sep_token")
    cls_token = tokenizer._parameters.get("cls_token")
    special_tokens = (sep_token, cls_token)
    wp_prefix = tokenizer._parameters.get("wordpieces_prefix")

    def add_tokens(stream):
        for eg in stream:
            tokens = tokenizer.encode(eg["text"], add_special_tokens=add_special_tokens)
            eg_tokens = []
            idx = 0
            for (text, (start, end), tid) in zip(tokens.tokens, tokens.offsets, tokens.ids):
                # If we don't want to see special tokens, don't add them
                if add_special_tokens and hide_special and text in special_tokens:
                    continue
                # If we want to strip out word piece prefix, remove it from text
                if hide_wp_prefix and wp_prefix is not None:
                    if text.startswith(wp_prefix):
                        text = text[len(wp_prefix) :]
                token = {
                    "text": text,
                    "id": idx,
                    "start": start,
                    "end": end,
                    # This is the encoded ID returned by the tokenizer
                    "tokenizer_id": tid,
                    # Don't allow selecting spacial SEP/CLS tokens
                    "disabled": text in special_tokens,
                }
                eg_tokens.append(token)
                idx += 1
            for i, token in enumerate(eg_tokens):
                # If the next start offset != the current end offset, we
                # assume there's whitespace in between
                if i < len(eg_tokens) - 1 and token["text"] not in special_tokens:
                    next_token = eg_tokens[i + 1]
                    token["ws"] = (
                        next_token["start"] > token["end"]
                        or next_token["text"] in special_tokens
                    )
                else:
                    token["ws"] = True
            eg["tokens"] = eg_tokens
            yield eg

#     def add_relations_to_stream(stream):
#        # custom_model = load_your_custom_model()
#        for eg in stream:
#           # deps, heads = custom_model(eg["text"])
# #           eg["relations"] = []
# #           for i, (label, head) in enumerate(zip(deps, heads)):
# #              eg["relations"].append({"child": i, "head": head, "label": label})
#           yield eg

    # stream = JSONL(source)       # load the data
    stream = add_tokens(stream)  # add "tokens" to stream
#     stream = add_relations_to_stream(stream)        # add custom relations

    return {
        "dataset": dataset,      # dataset to save annotations to
        "stream": stream,        # the incoming stream of examples
        "view_id": "relations",  # annotation interface to use
        "config": {
            "labels": ["para-conn", "para-excl", "para-limi", "appl-full", "appl-spec"],  # labels to annotate # , "appl-cert", "appl-remain"
            "relations_span_labels": ["legi-sho", "legi-ful", "date-exp", "date-imp", "para", "para-ref"]
        }

    }

Hi, I have a new descover here, I find that if I keep the relation as an empty list, then prodigy works fine, but I have to spent hundreds of hours to relabel the relation again, hope this can help to solve the problem, thanks and wait for you reply.

Hi @ocelot43, you might want to check the values you assigned to the head and child parameters of your examples. These values take the token index, not the character index. I noticed that you had a very big number, like 123, that exceeds the number of tokens in your text.

Since you're generating your tokens dynamically, you might also want to write a "validator" that confirms if your tokens are correct / reasonable.

In any case, do you still have the ./data/vocab.txt file so that I can further replicate your problem?

Hi @ljvmiranda921, sorry for reply late, I was busy with other things and forgot to check the progress here. As you said, these values take the token index , not the character index. I have fix the bug when I replace character index with token index. Everything works fine now. And in case you might need it, the vocab.txt is the same as the "bert-base-cased" in huggingface transformers. Thanks again for you advice and I will try to write a "validator" to check the input data next time.

1 Like

Glad it's working :slight_smile: