Custom Logger Function in spaCy v3.8.14

Hello,

I am updating a custom NER model, from v3.6.1 to v3.8.14 (current one at present day). I am reusing some samples originally labeled with Prodigy.

To do so, I am using a previously built GCP Pipeline (former Vertex AI Pipeline), which at some point, requires me to create a custom logger. My current version looks like:

@spacy.registry.loggers("spacy_history_logger.v1")
def custom_logger(log_path):
    def setup_logger(
        nlp: Language,
        stdout: IO=sys.stdout,
        stderr: IO=sys.stderr
    ) -> Tuple[Callable, Callable]:
        stdout.write(f"Logging to {log_path}\n")
        log_file = Path(log_path).open("w", encoding="utf8")
        
        def log_step(info: Optional[Dict[str, Any]]):
            if info:

                to_write = {
                    'epoch': info['epoch'],
                    'step': info['step'],
                    'score': info['score'],
                    'loss_ner': info['losses']['ner'],
                    'f1_score': info['other_scores']['ents_f']
                }
                
                log_file.write(json.dumps(to_write))
                log_file.write("\n")

        def finalize():
            log_file.close()

        return log_step, finalize

    return setup_logger

Now, the last time I used this (about 2 years ago) this pipeline worked fine, but nowadays, it is giving me an error at the 'to_write' values:

TypeError: Object of type float32 is not JSON serializable
File "/logger.py", line 30, in log_step
    log_file.write(json.dumps(to_write))

Hence, I have a really simple, but determining question: why this pipeline works fine with spacy v3.6.1 but fails with v3.8.14? What changed regarding loggers?

Thank you

Hi @dave-espinosa,

apologies for the delay!
Nothing about the logger registry/contract changed — what changed is the type of the values you're reading out of the info dict.

In your dict:

  • info["losses"]["ner"] — comes from the trainable pipe's update(), which assigns from numpy/Thinc ops (e.g. losses[self.name] += loss in spacy/pipeline/transition_parser.pyx:524). Across v3.7→v3.8 (Thinc / Cython 3.0 migration in support of Python3.13), these values are now propagated as numpy.float32 rather than being implicitly unwrapped to Python float.
  • info["other_scores"]["ents_f"] — comes from Scorer.score_spans and is now likewise a numpy scalar.

In v3.6.1 you usually got native Python floats back (Cython 0.29 / older Thinc), so json.dumps just worked. In v3.8.x they're np.float32, and the stdlib json encoder doesn't know how to serialize numpy scalars, hence TypeError: Object of type float32 is not JSON serializable.

If you look at the built-in console_logger_v3 at spacy/training/loggers.py:116-161, it explicitly casts every value: float(info["losses"][pipe_name]), float(score), float(info["score"]).

A minimal patch to your logger would be then:

  def log_step(info):
      if info:
          to_write = {
              "epoch": info["epoch"],
              "step": info["step"],
              "score": float(info["score"]),
              "loss_ner": float(info["losses"]["ner"]),
              "f1_score": float(info["other_scores"]["ents_f"]),
          }
          log_file.write(json.dumps(to_write))
          log_file.write("\n")

It will keep working regardless of whether spaCy hands you Python floats or numpy scalars in future versions.

1 Like

Hello @magdaaniol ,

I got to the same conclusion, thank you for confirming it. That worked. Before receiving your reply, I also added:

"epoch": int(info["epoch"]),
"step": int(info["step"]),

As additional safety measure.

BR,

Dave