Annotating strings without correct separation

Hey everybody,

I'm annotating examples from bankturnovers in prodify, with so far has worked very well.

Sadly sometimes I come past examples that look like this

...Kundennummer2785708...
...Hausmacherstr. 34Erstattung Stromkosten...
...Vertragsnummer 12345Zinsen 1.234,56Tilgung 456,78...

The relevant info for my NER here are often the numbers. But I noticed that - when annotating in prodigy - it is often not possible to label numbers or strings that are without spaces like Kundennummer2785708. Even though that would be the correct labeling.

I suppose this is because of the tokenization?
Is there any good way to solve this? e.g. switching the span-prediction - are there any drawbacks, since NER has worked great so far.

Thanks!

Hi @toadle,

If the NER architecture worked well so far I definitely wouldn't try to solve the tokenization issue by switching to the spancat architecture.
The issue is about data preprocessing not really the modelling technique and that's where we should address it.
The usual solution here would be to record some of these examples and see if you can fix the tokenization by rules. The best way to implement it would be to modify the default spaCy tokenizer by adding your custom rules so that you could easily integrate it in a spaCy pipeline both for annotation, training and production.
It does require learning a bit a more about customizing spaCy components but the documentation is excellent.

Prodigy ner.manual has the character highlighting mode that you can switch on and off from the UI. This would allow you to highlight subparts of a token but it wouldn't affect the tokenization, so you'd end up with spans that are misaligned with tokens and these would be rejected as training examples.
The character-level highlighting is meant for models that predict character-based tags not token-based tags, but you could use it to "record" the mistokenized examples and then use this record to write your custom tokenization rules.

The easiest way to check if your data contains misaligned span annotations is to convert Prodigy annotated example with tokens and spans to a spaCy doc.
If spans do not align with tokens, they are set to None and you can check for that in a simple Python script.
So once you've done your annotation in Prodigy, you could process your data to with a script similar to the one below to fish out the misaligned examples and try to fix them with the custom tokenizer:

def prodigy2spacy_ner(task: dict, nlp: Language) -> None:
    task_hash = task.get("_task_hash")
    tokens = task.get("tokens")
    words = [token["text"] for token in tokens]
    spaces = [token["ws"] for token in tokens]
    doc = Doc(nlp.vocab, words=words, spaces=spaces)
    
    prodigy_spans = task.get("spans", [])
    if prodigy_spans:
        spans = []
        for span in task["spans"]:
            spacy_span = doc.char_span(span["start"], span["end"], span["label"])
            if spacy_span is None:
                print(f"Misaligned span detected in example with task hash {task_hash}")
                print(f"Span: {span}")
                print()

Please see the related posts on dealing with similar "agglutinations" of words:

Thanks for the advice @magdaaniol !

I looked at my example data and managed to configure a tokenizer that works as I'd need it to:

import spacy
from spacy.util import compile_infix_regex

nlp = spacy.blank("de")


infixes = nlp.Defaults.infixes + [
    r",",                   # Always split on commas
    r"(?<=\d),(?=\w)",      # Split commas between digits and letters
    r"(?<=\d)(?=[A-Za-z])", # Split between digits and letters
    r"(?<=[a-z])(?=[A-Z])"  # Split where a lowercase precedes an uppercase letter
]

infix_regex = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer

I'm a bit at a loss on how I'd now utilize this tokenizer in prodigy.

Normally I'd just start prodigy like this to start labeling:

prodigy ner.manual bank_turnovers_properties blank:de corpus/bank_turnovers_properties_20241118_160622.jsonl --label TENANT_NAME,UNIT_NAME,PROPERTY_NAME

No custom config or anything.
How would I use this tokenizer in there now?

I just found this Training after annotating with custom tokenizer - #2 by magdaaniol and am trying to follow it.

So I follow the description in the above post and actually got it to work like this:

1. Write custom tokenizer
I created a file utils/my_tokenizer.py which looks like this:

from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex, compile_prefix_regex, compile_suffix_regex

def my_tokenizer(nlp):
    infixes = nlp.Defaults.infixes + [
        r",",                     # Always split on commas
        r"(?<=\d)(?=[A-Za-z])",   # Split between digits and letters
        r"(?<=[a-z])(?=[A-Z])",   # Split where lowercase precedes uppercase
        r"(?<=[A-Za-z])(?=\d)",   # Split where letters precede digits
        r"/",                      # Split at every slash
    ]

    infix_re = compile_infix_regex(nlp.Defaults.infixes + infixes)
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)

2. Prepare custom tokenizer pipeline for Prodigy
In order to be able to use the custom tokenizer within prodigy I wrote the tokenizer to a pipeline using a script named write_tokenizer_pipeline.py:

import spacy

from utils import my_tokenizer

# Load a blank German pipeline
nlp = spacy.blank("de")

# Replace the default tokenizer with the customized one
nlp.tokenizer = my_tokenizer(nlp)

nlp.to_disk('./tokenizer/my_tokenizer')

And then just calling python scripts/python/write_tokenizer_pipeline.py which writes the custom tokenizer pipeline to the tokenizer/my_tokenizer directory.

3. Start prodigy and do labeling
I started Prodigy like this and got to labeling:

prodigy ner.manual dataset_name ./tokenizer/my_tokenizer/ corpus/my_examples.jsonl --label LABEL1,LABEL2,LABEL3

4. Write a spaCy config from the custom tokenizer
In order to prepare training I then wrote it to a spaCy config:

prodigy spacy-config custom_tok.cfg --ner dataset_name --base-model ./tokenizer/my_tokenizer/

The above article states that you new to change some things in the configs, after doing this - but for me using prodigy v1.17.0 2024-11-18 no changes were needed.

5. Export datasets for spaCy

Since spaCy needs the binary format for training I exported the examples like this:

prodigy data-to-spacy  custom_tok_output --ner dataset_name --config custom_tok.cfg --base-model ./tokenizer/my_tokenizer

This creates a new directory named custom_tok_output with a spaCy config and the dev and train datasets.

6. Train a model with spaCy

After this I could train a new model like this:

spacy train custom_tok_output/config.cfg --paths.train custom_tok_output/train.spacy --paths.dev custom_tok_output/dev.spacy

I still got a few questions though @magdaaniol :

1. Side-effects of more tokens
During labeling I see examples that would need even fine tokens. Especially in numbers.
Does it have any side-effects to produce even more fine-grained tokens? Especially on numbers? I imagine that when I break a number like "12345" into ["1","2","3","4","5"] that this somehow loses info?

2. tok2vec shows loss
In this post you said that the tok2vec layer is not trained. I did not fully comprehend why, but for me the training output is like this:

So I actually see a loss in the tok2vec column. Is this expected now?

3. Using custom tokenizer in ner.correct

After I trainined and save the model from the custom tokenizer as above, should I be able to used it in a ner.correct recipe without problems like this?

prodigy ner.correct dataset_net ./models_custom_tok/model-best/ corpus/my_examples.jsonl --label LABEL1,LABEL2,LABEL3

Hi @toadle ,

Thanks for sharing your steps for annotating and training with the custom tokenizer. It looks all correct to me.
On to your follow-up questions:

1. Side-effects of more tokens

Technically, by spliting numbers into tokens you're not losing information, but you are making the learning task more complex for the model:

  • It becomes harder for the model to directly learn that this can be one five-digit number (important especially if you have numbers that indicate amounts i.e. value matters)
  • The model needs to learn additional patterns about how digits combine into meaningful numbers
  • More tokens means longer sequences (remember NER has a limited context window)
  • Each digit position becomes a separate classification task
  • Makes it harder to recognize number-based patterns (e.g., years, amounts, dates)
  • You need take extra care to make sure that this tokenization is consistent across the dataset

So it depends a bit on the kind of numbers you have in your data. If they are all patterns that need to be decomposed into constituents then splitting it could work. But if there's a mix of patterns, numbers that denote values, dates etc., I would keep coarse numeric patterns and perhaps add a rule-based component on top of NER to further parse these numers into their components.

2. tok2vec shows loss

In spaCy, you can configure whether multiple pipeline components share the same embedding (tok2vec) layer or whether each component uses its own independent one. If the NER component has its own tok2vec layer it won't propagate the loss to the standalone tok2vec layer and we would observe zeroes in the tok2vec loss column.
It looks like your pipeline is configured so that the NER component uses the shared tok2vec layer. Could you share your training config to corroborate that?
In any case, if you haven't seen it yet, I recommend checking spaCy docs on how the shared embedding layer works as both setups have their advantages and disadvantages and it's good to understand that.

3. Using custom tokenizer in ner.correct

Yes, if you are on Prodigy >= 1.15.1 (which I believe you are) that should work as in this version we automated the sourcing of the tokenizer from the spaCy pipeline.

@magdaaniol Thanks for the explanation.

About tok2vec
So I actually did not intentionally set anything in the config.

I only used this command:

prodigy spacy-config custom_tok.cfg --ner dataset_name --base-model ./tokenizer/my_tokenizer/

To produce this custom_tok.cfg:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "de"
pipeline = ["tok2vec","ner"]
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 1000
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = "incorrect_spans"
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0

[corpora.ner]
@readers = "prodigy.NERCorpus.v1"
datasets = ["bank_turnovers_properties"]
eval_datasets = []
default_fill = "outside"
incorrect_key = "incorrect_spans"

[training]
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null

[initialize.before_init]
@callbacks = "spacy.copy_from_base_model.v1"
tokenizer = "./tokenizer/my_tokenizer/"
vocab = "./tokenizer/my_tokenizer/"

[initialize.components]

[initialize.tokenizer]

Seems that tok2vec was included automatically? As for "shared embedding"-part: I don't fully understand what you mean by that and the consequences. I was under the impression that I'd need a tok2vec layer to get embeddings at all?

One additional question
Am I correct to believe that when I change my custom tokenizer to produce a different tokenization that my previously labeled examples are useless and need to be re-labeled?

I'm asking this because so far I've come up with the tokenization rules while looking at examples during labeling and then went back to change the tokenizer and have deleted the dataset too to start from scratch.

Hi @toadle ,

About tok2vec

In the config you generated with spacy-config you can see that the NER component uses the shared tok2vec layer:

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"

When the architecture is set to Tok2VecListener it means that it points to the tok2vec layer component defined as the first element of the pipeline:

pipeline = ["tok2vec","ner"]

This is the default behavior when a blank model is used as the base model. You technically use a blank model because you only modified the tokenizer and the other components - notably tok2vec and ner are not trained.

This means that during training:

  1. The shared tok2vec layer computes embeddings
  2. The NER component uses these embeddings for its predictions
  3. The loss from NER is backpropagated through both components, training both the NER model and the tok2vec layer and that's why you see tok2vec loss in your training log

And yes, tok2vec is added automatically and is necessary - it's the component responsible for converting tokens into dense vector representations (embeddings) that the NER model can use for predictions.

In contrast, when using a pre-trained pipeline like en_core_web_sm as the base model (which is the case in the post you cited), the config generated by default will configure the ner component to use an independent tok2vec layer:

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

The main reason for using an independent embedding layer with pre-trained pipelines is to protect the performance of other components:

  1. The pre-trained tok2vec layer is already optimized for components like parser and tagger
  2. Training NER with a shared layer would modify these shared embeddings (as described above)
  3. This could deteriorate the performance of these other components
  4. An independent layer allows NER to learn specialized embeddings without affecting other components

In summary, it's expected and beneficial to see tok2vec being trained in your case, as your base model is a blank spaCy pipeline.
Did you get the chance to see spaCy docs on shared vs independent embedding layer?

Am I correct to believe that when I change my custom tokenizer to produce a different tokenization that my previously labeled examples are useless and need to be re-labeled?

Yes, in principle, changing the tokenization usually means you need to relabel your data. NER annotations are tied directly to specific token boundaries so if tokenization changes, the original entity spans may no longer align correctly with the new tokens.
In practical terms it only matters if there are many misaligned examples in your first dataset (when new tokenization is applied). To generate training examples, spaCy will check if the span offsets are aligned with tokens and will reject the misaligned span annotations. So if there are only very few examples like that it's probably not worth the effort of reannotating all the other spans. If there are many - then yes.