Include normalization/cleanup in spaCy pipeline or not

Hey everybody,

this is no real Prodigy question but rather a question about training data: I recently extended my textcat model to use spacy_transformers and get much better results on unknown examples.

I'm build a model that can classify bank turnover purposes.
In order to make the transformer work great I first have to clean up the data to only hand "meaningful"-data to the transformer.

e.g. this

ETG SOME R-STR.27 NK SEPA-ÜBERWEISUNG IBAN+ DE12345678901234567890 BIC+ ACMEBICNAME SVWZ+ Utility costs 2024 John Doe KREF+ ACMEBICNAME1234567890

will be turned into

Utility costs 2024 John Doe

I've written a small regexp-based python class for this.

Now my question is: Can I reasonably include the clean-step into my spaCy pipeline?
The reason is that the real-source data look like the first example and if the cleanup is not part of the pipeline then I have to run the cleanup on the training data and before inference.

I tried build a pipeline component like this:

from spacy.language import Language
from spacy.tokens import Doc

from api.utils.purpose_cleaner import PurposeCleaner


@Language.factory("purpose_cleaner")
def create_purpose_cleaner(nlp, name):
    def purpose_cleaner_component(doc: Doc) -> Doc:
        text = doc.text
        cleaned = PurposeCleaner.clean_purpose(text)
        # Use nlp.make_doc to re-tokenize cleaned text
        return nlp.make_doc(cleaned)
    return purpose_cleaner_component

Then intengrating it into the pipeline:

[nlp]
pipeline = ["purpose_cleaner", "transformer", "textcat_multilabel"]

[components]

[components.purpose_cleaner]
factory = "purpose_cleaner"
...

But this results in

⚠ Aborting and saving the final best model. Encountered exception:
ValueError("[E949] Unable to align tokens for the predicted and reference docs.
It is only possible to align the docs when both texts are the same except for
whitespace and capitalization. 

Which makes sense.

What is the best advise to solve this? Running cleanup on the training data and before inference is the solution I currently have.

Hi @toadle,

While it's technically possible, spaCy pipelines components should not change the underlying text. Cf. spaCy docs on tokenization here:

spaCy’s tokenization is non-destructive, which means that you’ll always be able to reconstruct the original input from the tokenized output. Whitespace information is preserved in the tokens and no information is added or removed during tokenization. This is kind of a core principle of spaCy’s Doc object: doc.text == input_text should always hold true.

It speaks about the tokenizer but the same applies to any component that affects tokenization.
I'd say that your current procedure of keeping input transformation outside the spaCy pipeline is the right way. Otherwise it would be impossible to do the alignment between the "gold" data and your raw text.

Thx @magdaaniol - always helpful to have your advise about best-practices!