Include normalization/cleanup in spaCy pipeline or not

toadle · June 24, 2025, 12:26pm

Hey everybody,

this is no real Prodigy question but rather a question about training data: I recently extended my textcat model to use spacy_transformers and get much better results on unknown examples.

I'm build a model that can classify bank turnover purposes.
In order to make the transformer work great I first have to clean up the data to only hand "meaningful"-data to the transformer.

e.g. this

ETG SOME R-STR.27 NK SEPA-ÜBERWEISUNG IBAN+ DE12345678901234567890 BIC+ ACMEBICNAME SVWZ+ Utility costs 2024 John Doe KREF+ ACMEBICNAME1234567890

will be turned into

Utility costs 2024 John Doe

I've written a small regexp-based python class for this.

Now my question is: Can I reasonably include the clean-step into my spaCy pipeline?
The reason is that the real-source data look like the first example and if the cleanup is not part of the pipeline then I have to run the cleanup on the training data and before inference.

I tried build a pipeline component like this:

from spacy.language import Language
from spacy.tokens import Doc

from api.utils.purpose_cleaner import PurposeCleaner


@Language.factory("purpose_cleaner")
def create_purpose_cleaner(nlp, name):
    def purpose_cleaner_component(doc: Doc) -> Doc:
        text = doc.text
        cleaned = PurposeCleaner.clean_purpose(text)
        # Use nlp.make_doc to re-tokenize cleaned text
        return nlp.make_doc(cleaned)
    return purpose_cleaner_component

Then intengrating it into the pipeline:

[nlp]
pipeline = ["purpose_cleaner", "transformer", "textcat_multilabel"]

[components]

[components.purpose_cleaner]
factory = "purpose_cleaner"
...

But this results in

⚠ Aborting and saving the final best model. Encountered exception:
ValueError("[E949] Unable to align tokens for the predicted and reference docs.
It is only possible to align the docs when both texts are the same except for
whitespace and capitalization.

Which makes sense.

What is the best advise to solve this? Running cleanup on the training data and before inference is the solution I currently have.

magdaaniol · June 30, 2025, 11:10am

Hi @toadle,

While it's technically possible, spaCy pipelines components should not change the underlying text. Cf. spaCy docs on tokenization here:

spaCy’s tokenization is non-destructive, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization. This is kind of a core principle of spaCy’s Doc object: doc.text == input_text should always hold true.

It speaks about the tokenizer but the same applies to any component that affects tokenization.
I'd say that your current procedure of keeping input transformation outside the spaCy pipeline is the right way. Otherwise it would be impossible to do the alignment between the "gold" data and your raw text.

toadle · July 10, 2025, 12:28pm

Thx @magdaaniol - always helpful to have your advise about best-practices!

Topic		Replies	Views
Best Approach for My Project ner , spacy , project , best-practices	3	648	March 10, 2022
Remove tokens/word before training or prediction NER spacy , solved	2	2386	March 8, 2019
prodigy pipeline usage usage , spacy , solved	4	1130	July 3, 2019
data-to-spacy is not using my custom tokenizer ner , spacy	7	1088	May 15, 2023
Issue getting Tranformer-based NER pipeline working usage , spacy , transformers	3	1250	January 29, 2021

Include normalization/cleanup in spaCy pipeline or not

Related topics