Converting SpaCy training json file to Prodigy jsonl format

There’s currently no built-in converter, but it’s definitely on our list. We’re also hoping that the planned corpus management tool will make it easier to unify the formats, because you’ll be storing all your annotations and training corpora in the same place, and it’ll just natively integrate with spaCy, Prodigy and other tools.

But in the meantime, you should also be able to write a script that takes spaCy’s format, extracts only the text and the respective annotations and then converts the BILUO tags to offsets (spacy.gold has a helper function for that).

If you’ve already set up your training logic with spaCy and you only want the train-curve functionality, an easier way would probably be to just add it yourself. The logic itself isn’t so difficult – here are the basics of it:

factors = [(i + 1) / n_samples for i in range(n_samples)]
prev_acc = 0
for factor in factors:
    random.shuffle(examples)
    current_examples = examples[:int(len(examples) * factor)]
    # train model, compare to previous accuracy, output results etc.

Given a number of samples n_samples (e.g. 4 to run 4 iterations with 25/50/75/100), you first calculate a list of those factors, e.g. [0.25, 0.5, 0.75, 1.0]. For each factor, you then shuffle the examples and take a slice of them (the total number of examples times the factor).

If you store the previous accuracy, you can compare the new accuracy on each run and output the difference, to see how the accuracy is changing. You could also expand this and take other metrics into account (precision, recall), or even execute additional logic if the accuracy improvement exceeds a threshold.

If you only shuffle after taking a slice of the examples, you can measure how annotations that were added later influence the accuracy. This requires the examples to come in a meaningful order, though.