Using the output of ner.gold-to-spacy to train a new model

Hello,
As mentioned here: ner.batch-train is really slow i am not able to train the model via ner.batch-train

For the moment i have exported my dataset via ner.gold-to-spacy but the output does not seem compatible to https://spacy.io/api/annotation#json-input.

I have two questions:

  1. How can i use this file to train my model? can i not use it via cli command?
  2. I am using the same training corpus to train my custom entities, each dataset has its own custom NER entity. Example:

I was born in Rome on 1985-01-01

I annotate 1985-01-01 as DATE in the DATE dataset and Rome in the CITY dataset.

Is this a problem? I mean, spacy will see the same sentence one time with ONE annotation and another time with a different one. Is this a problem?

Thanks

Hi,

Sorry I missed this thread before. I’ve been writing about the same sort of question in this thread: Remarkable Difference Between Prodigy and Custom Training Times

There can be a problem here, yes, but we can take steps to solve it. For a start, you can use the prodigy.models.ner.merge_spans() function to group the annotations onto the same sentence. You should concatenate your datasets and pass them through this function, and then use the ner.print-dataset function to check that the results are correct. Next, you can pass your annotations through the ner.make-gold recipe, so that you can manually correct any missing entities. This should let you create a dataset you can use in spaCy or another NER tool.

1 Like

The ner.gold-to-spacy recipe currently only exports the annotations to character offset or BILUO format – it doesn’t yet reconcile spans referring to the same input hash. Adding an option to make it output the a full JSON file for training with spacy train is a good idea, though!

2 Likes

An option to create the full JSON file for training with spacy train would be excellent. I’ve been really confused about how to generate that format.

1 Like