As mentioned here: ner.batch-train is really slow i am not able to train the model via ner.batch-train
For the moment i have exported my dataset via ner.gold-to-spacy but the output does not seem compatible to https://spacy.io/api/annotation#json-input.
I have two questions:
- How can i use this file to train my model? can i not use it via cli command?
- I am using the same training corpus to train my custom entities, each dataset has its own custom NER entity. Example:
I was born in Rome on 1985-01-01
I annotate 1985-01-01 as DATE in the DATE dataset and Rome in the CITY dataset.
Is this a problem? I mean, spacy will see the same sentence one time with ONE annotation and another time with a different one. Is this a problem?
Sorry I missed this thread before. I’ve been writing about the same sort of question in this thread: Remarkable Difference Between Prodigy and Custom Training Times
There can be a problem here, yes, but we can take steps to solve it. For a start, you can use the
prodigy.models.ner.merge_spans() function to group the annotations onto the same sentence. You should concatenate your datasets and pass them through this function, and then use the
ner.print-dataset function to check that the results are correct. Next, you can pass your annotations through the
ner.make-gold recipe, so that you can manually correct any missing entities. This should let you create a dataset you can use in spaCy or another NER tool.
ner.gold-to-spacy recipe currently only exports the annotations to character offset or BILUO format – it doesn’t yet reconcile spans referring to the same input hash. Adding an option to make it output the a full JSON file for training with
spacy train is a good idea, though!
An option to create the full JSON file for training with
spacy train would be excellent. I’ve been really confused about how to generate that format.