[E143] Labels for component 'spancat' not initialized.

I've successfully trained 2 models using the below method, the only difference being that I needed to do character based annotation for this model, therefore added the --highlight-chars

python -m prodigy spans.manual uncompound blank:en extracted_compound_words.jsonl --label REF,PRICE,BOX_PAPER_TAG,BUY_INTENT,YEAR,MODEL,SIZE,CONDITION,COLOR,MATERIAL,MOVEMENT,SPECIAL_FEATURES,LOCATION,SHIPPING_INFO,LIMITED_EDITION,WARRANTY_STATUS,BRAND --highlight-chars

At this point, I have a dataset 'uncompound' that when I db-out into a jsonl, is formatted correctly for usage. So I do this:

python -m prodigy train ./training/uncompound_blank --spancat uncompound --eval-split 0.25 --label-stats

and I get this error:

File "spacy\pipeline\pipe.pyx", line 121, in spacy.pipeline.pipe.Pipe._require_labels
ValueError: [E143] Labels for component 'spancat' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's initialize method.

Considering I've been doing everything successfully from just the 'prodigy' command lines, I have no idea how to get out of this one. Any help?

Hi @jrouss ,

I'm not 100% sure yet if that's the reason for the initialization error but --highlight-chars will cause mismatch in tokenization during training which, in turn, leads to examples being dropped.
Which Prodigy version are you using?
Do you see any warnings about examples being skipped in the console?

prodigy 1.14.14

I did the process from the start by doing what my original message says and I see no messages in the console

here's the entire "error"

[2024-02-12 18:33:21,798] [INFO] Pipeline: ['tok2vec', 'spancat']
[2024-02-12 18:33:21,804] [INFO] Created vocabulary
[2024-02-12 18:33:21,804] [INFO] Finished initializing nlp object
Traceback (most recent call last):
File "C:\Users\Null\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
File "C:\Users\Null\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86,
in run_code
exec(code, run_globals)
File "C:\Users\Null\Desktop\spancat\venv\lib\site-packages\prodigy_main
.py", line 50, in
main()
File "C:\Users\Null\Desktop\spancat\venv\lib\site-packages\prodigy_main
.py", line 44, in main
controller = run_recipe(run_args)
File "cython_src\prodigy\cli.pyx", line 123, in prodigy.cli.run_recipe
File "cython_src\prodigy\cli.pyx", line 124, in prodigy.cli.run_recipe
File "C:\Users\Null\Desktop\spancat\venv\lib\site-packages\prodigy\recipes\train.py", line 308, in train
return _train(
File "C:\Users\Null\Desktop\spancat\venv\lib\site-packages\prodigy\recipes\train.py", line 221, in _train
nlp = spacy_init_nlp(config, use_gpu=gpu_id)
File "C:\Users\Null\Desktop\spancat\venv\lib\site-packages\spacy\training\initialize.py", line 95, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "C:\Users\Null\Desktop\spancat\venv\lib\site-packages\spacy\language.py", line
1349, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "C:\Users\Null\Desktop\spancat\venv\lib\site-packages\spacy\pipeline\spancat.py", line 675, in initialize
self._require_labels()
File "spacy\pipeline\pipe.pyx", line 121, in spacy.pipeline.pipe.Pipe._require_labels
ValueError: [E143] Labels for component 'spancat' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's initialize method.

Hi @jrouss,

Effectively, all your examples are being dropped due to tokenization mismatch. When initializing spancat component, spaCy collects the labels from all valid spans. If no span is valid, which will be the case if all spans are character subsets of tokens, there are no labels to initialize the component for and that is the reason for the error you're seeing.

As we advise in the docs on highlight-chars, the same tokenizer should be used during the annotation and training:

When using character-based highlighting, annotation may be slower and there's no guarantee that the spans you annotate map to actual tokens later on. If your goal is to train a named entity recognizer, you should consider using the same tokenizer during annotation, to make sure that your data can be used. Also see the section on efficient annotation for transformers if you're training a transformer-based model (e.g. BERT) with subword tokenization.

It's true that this warning is in the NER section of the documentation and it should be added to the SPAN section as well.

Before recommending the next steps, I'd like to understand more about your use case. I recall you were working with agglutinations of the kind:

Sometimes, messages turn into things like this: ltbbanana10usd

where 'ltb' = buy_intent, 'banana' = product, '10usd' = price

I'd like to reiterate here that for these kinds of problems it's really recommend to split the analysis into steps to avoid mixing "regular" spans with spans that are substrings.
In other words, you'd be detecting the agglutinations in step 1 (as discussed in the quoted post) and in step 2 you'd be dealing with the splitting them into tokens. I imagine you're now tackling step 2.
If that's the case, I wouldn't go straight to training the model as we need to sort out the splitting first. Character-based annotations are supposed to help you build the custom tokenizer to deal with these. Now that you are more familiar with the problem through the experience of annotation, it would be best to analyze whether it's feasible to split these using rules by considering the following questions:
How many examples of these do you have in total?
How much variation there is?
Are any substrings a finite set of tokens that you can use patterns or dictionary lookup for?
Do you expect new combinations to show up in production?
Are any substrings capturable by regex? (looks like 10usd is for example)

You should be using your current annotation as a test set for the custom rules you'll develop.
It would probably be most convenient to add these rules to a custom spaCy tokenizer..
Once you have your custom tokenizer in place, you should be able to reapply the current spans annotations to the re-tokenized dataset with a Python script. We can help with this when you get there.

Another advantage of this 2-step approach is that you could have some fallback mechanism for the agglutinations that don't get tokenized corectly by your rules.
In general, language models should be really good in dealing with these kinds of problems (both the detection of the agglutinations and splitting them into subwords).
I tried your example ltbbanana10usd in ChatGPT and it did very well on it:
[

You could consider adding a spaCy llm (Large Language Models · spaCy Usage Documentation) component to your pipeline.
There are also some Python libraries out there that might be of help: here's an example: GitHub - droid-surbhi/split-compound-words
Finally, if your solution for splitting these words overgenerates a bit it's still less bad than preserving some of these agglutinations because for training purposes, spans can be made up of multiple tokens, so it wouldn't be a technical issue.

1 Like

Thank you for your detailed analysis and response:

I'll admit, I'm pretty new/naive to everything but had great success training 2 models to do some labeling, so just thought the "magic" would just work with character annotation

How many examples of these do you have in total? low hundreds, but can get more (slowly) in real-time from my leads
How much variation there is? usually minor with just 2 'labels' being stuck together, however it's user free-text from internet so sometimes I've seen 4 or 5 'labels' stuck together
Are any substrings a finite set of tokens that you can use patterns or dictionary lookup for? do you mean something like color? if so, some - but others are sort of slang terms I'd need to manually build a dataset with
Do you expect new combinations to show up in production? yes- it's all free-text from various language users
Are any substrings capturable by regex? (looks like 10usd is for example) simple ones, but I have around 18 possible labels where combinations of 1-5 of them being stuck together into 1 compound word

I'll probably start with the custom tokenizer. I am going to read through the links you've sent and see if I can get something working more from a lower level/different route. So far, my use of prodigy has been the command lines like 'prodigy spans.manual ... etc' - so all a higher level. I'll keep this thread updated if I make progress.

Hi @jrouss ,

From your answers it looks like it might be tricky to capture these by rules. But give it a try, maybe you can get the important majority of them right. If not I would recommend tasking a language model with splitting it. With spacy-llm it is very convenient to plug in an LLM of choice (including the opens-source local ones) and define your own prompt.
Looking forward to hear how it goes!