Converting SpaCy training json file to Prodigy jsonl format

ines · July 17, 2018, 10:32am

There’s currently no built-in converter, but it’s definitely on our list. We’re also hoping that the planned corpus management tool will make it easier to unify the formats, because you’ll be storing all your annotations and training corpora in the same place, and it’ll just natively integrate with spaCy, Prodigy and other tools.

But in the meantime, you should also be able to write a script that takes spaCy’s format, extracts only the text and the respective annotations and then converts the BILUO tags to offsets (spacy.gold has a helper function for that).

If you’ve already set up your training logic with spaCy and you only want the train-curve functionality, an easier way would probably be to just add it yourself. The logic itself isn’t so difficult – here are the basics of it:

factors = [(i + 1) / n_samples for i in range(n_samples)]
prev_acc = 0
for factor in factors:
    random.shuffle(examples)
    current_examples = examples[:int(len(examples) * factor)]
    # train model, compare to previous accuracy, output results etc.

Given a number of samples n_samples (e.g. 4 to run 4 iterations with 25/50/75/100), you first calculate a list of those factors, e.g. [0.25, 0.5, 0.75, 1.0]. For each factor, you then shuffle the examples and take a slice of them (the total number of examples times the factor).

If you store the previous accuracy, you can compare the new accuracy on each run and output the difference, to see how the accuracy is changing. You could also expand this and take other metrics into account (precision, recall), or even execute additional logic if the accuracy improvement exceeds a threshold.

If you only shuffle after taking a slice of the examples, you can measure how annotations that were added later influence the accuracy. This requires the examples to come in a meaningful order, though.

Topic		Replies	Views
Prodigy annotations to SpaCy train spacy	13	5616	January 31, 2018
unable to convert prodigy jsonl to spacy training json usage , spacy	3	1462	June 26, 2020
Prodigy ner.batch-train vs Spacy train usage , spacy , best-practices	13	3498	June 2, 2020
Create a dataset out of many txt_files documents (Best Practice) usage , ner , best-practices	4	1819	March 30, 2021
Training prodigy ner data through spacy usage , ner , spacy , solved	3	893	January 8, 2020

Converting SpaCy training json file to Prodigy jsonl format

Related topics