Converting SpaCy training json file to Prodigy jsonl format

I have an existing NER and POS datasett that i have converted into the format used by the command line interface for training a Spacy model. I want to import this data to use with Prodigy for two reasons:

  1. I want to check the training curve (for example 25/50/75/100% of the data) and Prodigy has a nice function for this (does Spacy have this too?)
  2. If the quality of the model is still increasing in the last 75%-100% of the data I want to expand the training set.

Is there any way of converting the json file that i created with Spacy into the prodigy jsonl format?

There’s currently no built-in converter, but it’s definitely on our list. We’re also hoping that the planned corpus management tool will make it easier to unify the formats, because you’ll be storing all your annotations and training corpora in the same place, and it’ll just natively integrate with spaCy, Prodigy and other tools.

But in the meantime, you should also be able to write a script that takes spaCy’s format, extracts only the text and the respective annotations and then converts the BILUO tags to offsets ( has a helper function for that).

If you’ve already set up your training logic with spaCy and you only want the train-curve functionality, an easier way would probably be to just add it yourself. The logic itself isn’t so difficult – here are the basics of it:

factors = [(i + 1) / n_samples for i in range(n_samples)]
prev_acc = 0
for factor in factors:
    current_examples = examples[:int(len(examples) * factor)]
    # train model, compare to previous accuracy, output results etc.

Given a number of samples n_samples (e.g. 4 to run 4 iterations with 25/50/75/100), you first calculate a list of those factors, e.g. [0.25, 0.5, 0.75, 1.0]. For each factor, you then shuffle the examples and take a slice of them (the total number of examples times the factor).

If you store the previous accuracy, you can compare the new accuracy on each run and output the difference, to see how the accuracy is changing. You could also expand this and take other metrics into account (precision, recall), or even execute additional logic if the accuracy improvement exceeds a threshold.

If you only shuffle after taking a slice of the examples, you can measure how annotations that were added later influence the accuracy. This requires the examples to come in a meaningful order, though.

I’m looking forward to the corpus management tool and i can try to contribute if needed when the project is started.

Ill add the train-curve functionality to the spacy training logic to begin with, thank you for the example code, tips and quick answer :slight_smile:

@ines i managed to split the data and run the models with 25, 50, 75 and 100% of the data. The results where a bit strange. When comparing the best model from 75% vs 100% there is an increase of about 3% (increase from 75.98 -> 79.2), however when evaluating the models using the test data, there is little difference (68.51 -> 68.75).

Do you think i should consider creating additional data using Prodigy? The training set consists of about 15.000 sentences and 0.7 in f-score seems to be a bit low considering how much data there is. I have not personally gone through the data to verify its quality.

Can you recommend any other tests that can help me identify what the problem might be? or maybe test the quality a bit faster that reading through the conllu file?

I see that most of you NER models have around 0.82-0.88 in f-score. Do you think it is possible to increase the f-score from 0.7 to 0.8 through tweaking hyper parameters? or do you think the underlying data needs to improve?

@ohenrik It’s pretty hard to compare NER results across corpora. I think you might want to do some error analysis to understand your results better. A good check is to print out three columns: Entity, True Tags, Model Tags. Then you can pipe that through sort | uniq -c | sort -rn to find the most frequent entities, and how your model did on them.

The most important thing for NER performance is how ambiguous the entities are. If you have common entities that have been annotated to have type ambiguities, that’ll really kill your performance. For instance, let’s say you’re annotating sporting text and you have ‘Australia’ sometimes as a country, but sometimes as whatever sports team. That’ll be really difficult for the model. Another example is ‘Trump’ as the person, vs ‘Trump’ as the business. Sometimes the context won’t really make this clear, so the model will struggle.

If the model mostly has to memorise direct word-to-tag mappings, it’ll perform pretty well even with not so much data. Similarly, if casing is a good clue as it is in English, baseline performance will be pretty decent.

The big difference in accuracy between your test data and your development data is also pretty concerning. If you’re using a random split from the training set, the question is whether the data is fully annotated, or whether it’s the sparse annotations as per ner.teach. Now that you have a reasonably sized data set, I would definitely want to have fully annotated, static datasets for both development and testing. Then you would pass the development data to the train-curve command, with the -es argument.

You usually want to create both your development and test sets at once, using exactly the same methodology. Then you randomly shuffle them and split them in two. This way you’re drawing the two sets from the same distribution.

The logic here is that you want the development set to closely match the accuracy on the test set, unless your hyper-parameter search overfit. If your development set and your test set differ too much, then you don’t really know what to think when the two accuracies differ. Did you over-fit? Who knows?

Thanks for some good advice @honnibal! I’ll get to work on debugging this.

Regarding this:

Not sure what you are thinking regarding the sort command example, i do not have any files that i can sort this way. Is it just ment as an abstract example? I haven’t yet run the normal Prodigy train-curve command as i do not have the data converted to use with Prodigy yet. I’m currently just creating a manual version using Spacy.

I can imagine one example where i just load the training set, run the model on each sentence and then create statistics about the performance relative to each unique true tag using pandas or something similar. (Let me know if I’m completely off road here).

I’m not sure what you mean with sparse here. Do you mean that each example only have one tag? So that a sentence like “After i won the lottery I bought an new iPhone and a Toyota Prius” would create two examples with only one entity marked in each example? I think the data i have has this sentence as one example with two entities marked.

I can dig deeper in the data and see if there are many examples of untagged entities, should i try to remove as much of this these errors as possible?

There are definitely sentences that do not contain any possible tags (should i remove these?) e.g. “He went to the store”.

The datasett i use came already split into 3 sets (train, dev and test). but i can merge them and then re split them and se if that changes anything.

Also one last thing, are there any simple way to prioritize recall over precision while training the models using SpaCy or prodigy?

This was supposed to be sorting the output of the print statements I mentioned. Basically you’d generate a list of the entities and their taggings and then sort uniq it, to get the frequencies.

Yeah sure, same sort of thing. Pandas wasn’t around when I started doing these things and I’ve still never really found it better than the way I was already doing things, so I don’t tend to use it. I guess also because I run from command line, not from Jupyter notebooks.

If you know you have all the entities marked in your data, then I would call that “dense” or “fully annotated”. In contrast it’s possible to have data where you have accepts and rejects, but you don’t necessarily know the gold-standard. The fully annotated data is easier to reason about, while the sparse data is quicker to create.

You could use the ner.make-gold recipe for this.

No, those are useful examples, as they stop the model from making potential mistakes.

There are some possibilities, but none that are very simple. I suppose the best would be to use the beam parsing.

Thank you for clarifying :slight_smile:

I managed to re split the datasett as you explained and that improved the results significantly. The best iteration had 85.163 in NER F. and the evaluation of that model at the end had nearly the same f-score :slight_smile: I’ll double check today that i haven’t made any mistakes, but it seems like the earlier problem might have been related to how the data was split and randomized.

These are the results from the evaluation:

(spacy_tranining) ➜  spacy_traning python -m spacy evaluate model_out8/model14 ner_data_resplit/no-ud-test-ner.json


    Time               9.70 s         
    Words              38127          
    Words/s            3932           
    TOK                100.00         
    POS                95.56          
    UAS                88.90          
    LAS                86.24          
    NER P              85.36          
    NER R              86.11          
    NER F              85.73  

I can share the model and datasett once it is officially released by the university (hopefully at latest by the end of the summer) :tada: