update spacy model

Hi !

I got a question about updating a spacy ner model.

I did the annotation on prodigy, I trained a first version of a model with prodigy.
(I exported the annotations using db-out.)
I use data-to-spacy to prepare the files to create a spacy model.
I created the spacy model. Now I want to know how to update my model with adding new annotations in my spacy model ? How to create a corpora wich includes old and new annotation to train a new model ?

Because data training.spacy use an other file type / format

("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),

And Prodigy generated an other file type with data to spacy :

`{"text":"Contrat de travail \u00e0 dur\u00e9e d\u00e9termin\u00e9e","_input_hash":-1746332190,"_task_hash":1801907309,"tokens":[{"text":"Contrat","start":0,"end":7,"id":0,"ws":true},{"text":"de","start":8,"end":10,"id":1,"ws":true},{"text":"travail","start":11,"end":18,"id":2,"ws":true},{"text":"\u00e0","start":19,"end":20,"id":3,"ws":true},{"text":"dur\u00e9e","start":21,"end":26,"id":4,"ws":true},{"text":"d\u00e9termin\u00e9e","start":27,"end":37,"id":5,"ws":false}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":37,"token_start":0,"token_end":5,"label":"TYPECONTRAT"}],"answer":"accept"}

`

I tried to update the model using only new datas, and it seems like my model only did the training with the new datas and ignore the old data in analysis.

Thank you !

Hi! There are two ways you could do this: One would be to convert your old annotations to a .spacy file and then use data-to-spacy in Prodigy to export a .spacy file from your annotations. You can then train your model from a directory with both files in it.

The other option would be to convert your previous data to Prodigy's format – this should be pretty easy to do because it already has the character offsets, which is exactly what Prodigy stores in the "spans". So you could do something like:

import spacy
from prodigy.components.preprocess import add_tokens

new_examples = []
for text, annots in your_old_examples:
    spans = [{"start": start, "end": end, "label": label} for start, end label in annots["entities"]]
    eg = {"text": text, "spans": spans, "answer": "accept"}

nlp = spacy.blank("en")
examples = list(add_tokens(nlp, new_examples))

You can then either save examples to a file and import it to Prodigy, or add it to the database straight away in Python.

Hi Ines ! :slight_smile:

Thanks for your reply !

Hi! There are two ways you could do this: One would be to convert your old annotations to a .spacy file and then use data-to-spacy in Prodigy to export a .spacy file from your annotations. You can then train your model from a directory with both files in it.

Our data are already in format .spacy. We would like to know if we can merge an unlimited number of . spacy files ? Because our model will be updated very often.

Each time we will have new data to add we will transform it in .spcay. Then we will add them in the directory where the others .spacy files are. we thought we would proceed as follows.

The other option would be to convert your previous data to Prodigy's format – this should be pretty easy to do because it already has the character offsets, which is exactly what Prodigy stores in the "spans". So you could do something like

is there a simple way to convert our .json file from prodigy to span format like in your example?

Thanks !

Yes, you can just provide a directory of .spacy files when you train so it's no problem to keep adding files to it. If you're already storing your data in .spacy, that's definitely the easiest solution.

If you want to convert Prodigy annotations to .spacy files, you can use the data-to-spacy command: https://prodi.gy/docs/recipes/#data-to-spacy

Hi Ines, :slight_smile:

The method data-to-spacy convert the data into .spcay wich is a binary file to work with the first method you gave me. But, I nedd to work with the second method. So I need this format :

("Walmart is a leading e-commerce company", [(0, 7, "ORG"), (8, 10, "ORG2")]),
("I reached Chennai yesterday.", [(19, 28, "GPE")]),

Because the json file exported by prodigy contains a lot of useless information for example contains all the text, and we need the entities corrected, and contains much more information that we are not interested in.
I give you an example of a json file that we have

{"text":"Contrat de travail ,"_input_hash":-1746332190,"_task_hash":1801907309,"tokens":[{"text":"Contrat","start":0,"end":7,"id":0,"ws":true },{"text":"de","start":8,"end":10,"id":1,"ws":true},{"text":"travail","start":11,"end":18,"id":2,"ws":true},{"text":"\u00e0","start":19,"end":20,"id":3,"ws":true},{"text":"dur\u00e9e","start":21,"end":26,"id":4,"ws":true},{"text":"d\u00e9termin\u00e9e","start":27,"end":37,"id":5,"ws":false}],"_session_id":null,"_view_id":"ner_manual","spans":[{"start":0,"end":37,"token_start":0,"token_end":5,"label":"TYPECONTRAT"}],"answer":"accept"}

does not contain only (text, [start, end , label]), and I need to make this complicated format json to

("TEXT.", [(start, end, "Label")]),

because we have a big corpus and I can't convert manually to extract only the information we need,
is there a command to generate an uncomplicated json from prodigy that contains the information with this format

("TEXT.", [(start, end, "Label")]),

or a code that makes our json complicated contains only this information

("TEXT.", [(start, end, "Label")]),

Thank you :slight_smile: !

Why do you need to create the data from the JSON instead of using data-to-spacy? The thing about data-to-spacy is that it also does some other useful things: it combines all annotations on the same example, so you can create a corpus for annotations for multiple components and merges multiple labels. It also gives you more useful debugging information. So we'd always recommend using that.

There's no command, but it's a pretty simple transformation:

for eg in examples:
    spans = [(span["start"], span["end"], span["label"]) for span in eg.get("spans", [])]
    data = (eg["text"], spans)
1 Like

Thank you for the details I understand better how to proceed !