Hi @jrouss,
Effectively, all your examples are being dropped due to tokenization mismatch. When initializing spancat
component, spaCy collects the labels from all valid spans. If no span is valid, which will be the case if all spans are character subsets of tokens, there are no labels to initialize the component for and that is the reason for the error you're seeing.
As we advise in the docs on highlight-chars
, the same tokenizer should be used during the annotation and training:
When using character-based highlighting, annotation may be slower and there's no guarantee that the spans you annotate map to actual tokens later on. If your goal is to train a named entity recognizer, you should consider using the same tokenizer during annotation, to make sure that your data can be used. Also see the section on efficient annotation for transformers if you're training a transformer-based model (e.g. BERT) with subword tokenization.
It's true that this warning is in the NER section of the documentation and it should be added to the SPAN section as well.
Before recommending the next steps, I'd like to understand more about your use case. I recall you were working with agglutinations of the kind:
Sometimes, messages turn into things like this: ltbbanana10usd
where 'ltb' = buy_intent, 'banana' = product, '10usd' = price
I'd like to reiterate here that for these kinds of problems it's really recommend to split the analysis into steps to avoid mixing "regular" spans with spans that are substrings.
In other words, you'd be detecting the agglutinations in step 1 (as discussed in the quoted post) and in step 2 you'd be dealing with the splitting them into tokens. I imagine you're now tackling step 2.
If that's the case, I wouldn't go straight to training the model as we need to sort out the splitting first. Character-based annotations are supposed to help you build the custom tokenizer to deal with these. Now that you are more familiar with the problem through the experience of annotation, it would be best to analyze whether it's feasible to split these using rules by considering the following questions:
How many examples of these do you have in total?
How much variation there is?
Are any substrings a finite set of tokens that you can use patterns or dictionary lookup for?
Do you expect new combinations to show up in production?
Are any substrings capturable by regex? (looks like 10usd
is for example)
You should be using your current annotation as a test set for the custom rules you'll develop.
It would probably be most convenient to add these rules to a custom spaCy tokenizer..
Once you have your custom tokenizer in place, you should be able to reapply the current spans annotations to the re-tokenized dataset with a Python script. We can help with this when you get there.
Another advantage of this 2-step approach is that you could have some fallback mechanism for the agglutinations that don't get tokenized corectly by your rules.
In general, language models should be really good in dealing with these kinds of problems (both the detection of the agglutinations and splitting them into subwords).
I tried your example ltbbanana10usd
in ChatGPT and it did very well on it:
[
You could consider adding a spaCy llm
(Large Language Models · spaCy Usage Documentation) component to your pipeline.
There are also some Python libraries out there that might be of help: here's an example: GitHub - droid-surbhi/split-compound-words
Finally, if your solution for splitting these words overgenerates a bit it's still less bad than preserving some of these agglutinations because for training purposes, spans can be made up of multiple tokens, so it wouldn't be a technical issue.