Hey everybody,
this is no real Prodigy question but rather a question about training data: I recently extended my textcat model to use spacy_transformers
and get much better results on unknown examples.
I'm build a model that can classify bank turnover purposes.
In order to make the transformer work great I first have to clean up the data to only hand "meaningful"-data to the transformer.
e.g. this
ETG SOME R-STR.27 NK SEPA-ÜBERWEISUNG IBAN+ DE12345678901234567890 BIC+ ACMEBICNAME SVWZ+ Utility costs 2024 John Doe KREF+ ACMEBICNAME1234567890
will be turned into
Utility costs 2024 John Doe
I've written a small regexp-based python class for this.
Now my question is: Can I reasonably include the clean-step into my spaCy pipeline?
The reason is that the real-source data look like the first example and if the cleanup is not part of the pipeline then I have to run the cleanup on the training data and before inference.
I tried build a pipeline component like this:
from spacy.language import Language
from spacy.tokens import Doc
from api.utils.purpose_cleaner import PurposeCleaner
@Language.factory("purpose_cleaner")
def create_purpose_cleaner(nlp, name):
def purpose_cleaner_component(doc: Doc) -> Doc:
text = doc.text
cleaned = PurposeCleaner.clean_purpose(text)
# Use nlp.make_doc to re-tokenize cleaned text
return nlp.make_doc(cleaned)
return purpose_cleaner_component
Then intengrating it into the pipeline:
[nlp]
pipeline = ["purpose_cleaner", "transformer", "textcat_multilabel"]
[components]
[components.purpose_cleaner]
factory = "purpose_cleaner"
...
But this results in
⚠ Aborting and saving the final best model. Encountered exception:
ValueError("[E949] Unable to align tokens for the predicted and reference docs.
It is only possible to align the docs when both texts are the same except for
whitespace and capitalization.
Which makes sense.
What is the best advise to solve this? Running cleanup on the training data and before inference is the solution I currently have.