Does Prodigy load pre-annotated data?

I am trying to find new entities with a very narrow domain. I have a list of phrases which i found from the text using spaCy PhraseMatcher. Now, I have the text, token numbers and their labels. I want Prodigy to start with my labels instead of suggesting from the given seeds. And I want it to learn from only with the changes that is done on the interface. Is it possible?

My labels will look like this:

{
"text":"Pumped 18 M3 1,45 SG EZ spot pill, displaced same with 18 M3 1,4 SG mud. Spotted 10,3 M3 around BHA, left 7,7 M3 inside pipe.",
"spans":[
	{"start":24,"end":33,"token_start":6,"token_end":7,"label":"EQUIPMENT"},
	{"start":120,"end":124,"token_start":28,"token_end":28,"label":"EQUIPMENT"}
	]
}

Can prodigy start with this and allow me to change and learn from there?

Hi! It looks like your data already has the right format, so you could import it into Prodigy using the db-in command, and pre-train a model using ner.teach. You can also do this in spaCy if you prefer. Once you have a pre-trained model, you can load it into the ner.teach recipe and start improving it.

Here’s an example workflow:

# import your data into Prodigy
prodigy db-in your_dataset_name /path/to/your_data.jsonl

# pre-train a model with your data
prodigy ner.batch-train your_dataset en_core_web_sm --output /path/to/output-model --no-missing

# improve the model interactively
prodigy ner.teach your_new_dataset /path/to/output-model --label EQUIPMENT

The --no-missing flag on ner.batch-train tells spaCy that the "spans" labelled in your data are complete and that there are no missing values. So basically, that you’ve annotated all entities that exist in the data. If that’s not the case, you can always load your pre-annotated data into the ner.manual recipe to edit and correct it and add more labels.

The model you export after training can then be used as the base model for ner.teach. Prodigy will then start suggesting you more EQUIPMENT entities, especially the ones that it’s most uncertain about. After doing some annotation, you can re-run ner.batch-train without --no-missing (because you’re only training from binary annotations) and see if the model improves. How well this will work and how many annotations you need depends on the amound of initial training examples, and how easy it is for the model to generalise.

Hi Ines,
Thank you for the detailed response. This is very helpful.

In case, if i have multiple labels in future, should i run it with the “–no-missing” parameter?

If your annotations cover all entities that are present in the text, then yes. Otherwise, or if your data only contains binary yes/no annotations on single spans, then you can leave out the --no-missing flag.

It's also usually a good idea to have a dedicated evaluation set, so you can see if your model is improving and learning the right thing. You can use the --eval-id setting on ner.batch-train to specify the name of the evaluation dataset.

Hi Ines:
Thank you very much. I could load my pre-labeled data and start labeling. (yey!) When I tried to train using ner.batch-train, i have a couple of things happening:

  1. in the accuracy, I am getting 0 correct and all others incorrect - is it because i do not have ‘reject’ examples? I am getting a 0.6 accuracy in the first iteration though.
  2. Python quits unexpectedly in the second (or first) iteration. (could be a problem with system capacity?) This did not complete the training.

Yes, this is possible. Prodigy's regular training mode is optimised to train from binary annotations and sparse labels, so you usually want examples of correct and incorrect spans. You can easily bootstrap those examples, though – either by randomly creating wrong annotations, or by doing a round of ner.teach and accepting / rejecting the model's suggestions.

Maybe you're running out of memory? How much memory does your machine have? Alternatively, it's also possible that spaCy fails on a particular example – there have been a few reports about segmentation faults during training and they can sometimes be caused by unexpectedly long examples, or by "strange" or unexpected annotated spans. If you can find out whether it consistently fails on the same or similar examples, and which example(s) it is, that would be super helpful! The source of the recipes is shipped with Prodigy, so you can edit the ner.batch-train recipe, add a few print statements that output the current batch / example, and then see if you can spot anything suspicous when the process terminates.

Hi Ines,
Thank you for the immediate response.

On 1:
I will add reject examples and try.

On 2:
My system has 16G.
When i tried multiple times, it quit at different % of training (38, 78, 79) and one of them completed first iteration and quit on second iteration. I am using spaCy phrase matcher to find the examples where the max length condition is satisfied. As this is a spacy created output, i am guessing it may not break because of “too long” an example for spacy.
That was a segmentation fault. (screenshot attached)

Okay, 16gb should definitely work. But if you can print the examples and find the batch it fails on, maybe this can give us more clues to what’s going on here. (Even if it’s just one single example that’s causing the crash, the fact that it’s terminating at different points during training makes sense, because the examples are shuffled.)

We’ve also been fixing a bunch of stuff for spaCy v2.1.0 that will hopefully make training more stable and less sensitive to “unexpected” examples.

Thank You Ines. I will get back with examples :slight_smile:

Hi I am trying to continue with this, I am looking for ‘EntityRecognizer’ that you are importing in ner.batch-train recipe - to give a print of example text. i couldn’t find it. Am i looking at the right place? is it the right place to get examples out? (this is all in the site-packages i am lookin gin)

I'm not 100% sure I understand your question but I don't think the EntityRecognizer (which is Prodigy's built-in NER model) is what you need. What exactly are you trying to do?

To make working with the code easier, you can also check out the API docs in the PRODIGY_README.html, which includes details on the available functions, what they are etc.

Ok. This is what i am trying to do:
I have a list of entities in the dictionary. I have labeled it using spacy matcher and output a jsonl that will be accepted by prodigy. Using the ‘db-in’ function i consumed those dictionary labeled data into a dataset into prodigy. Now, i want to create a base model out of that dictionary annotations. I am using ner.batch-train for that:
prodigy ner.batch-train my_data en_core_web_sm --output my_data_model --no-missing

Now this gets killed in the first or second iteration. it is a segmentation fault and python quits unexpectedly. I don’t know how to solve this. (If i can create a base model with this, i hope i will be able to improve it further using ner.teach)

Ok. I converted the same labeled data into spacy compatible format and it ran through over 30 iterations successfully and created model. is the training set being too big a problem?

The segmentation fault might be related to an issue we fixed for the recent Prodigy release, that especially affected large datasets. Could you try again with the new version? Alternatively, training in spaCy should work fine, too.

I am apparently ending up with the same kind of problem with all 3 trys:

  1. prodigy latest version also gives the same segmentation fault i had before.
  2. I converted the data into spacy compatible training data. this one now gives me this error and quits: “Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)” this also seems to be a segmentation fault.
    So, finally i am coming to a conclusion that the whole thing might be a problem carried over from spacy.

My training data is multi-label. I am going to seperate out the individual labels and try training on individual labels and check if this is working.

If you could help me on tracking the problem, it would be great.

Hi @ines:

Even with a smaller dataset, single label, multiple label - i am getting this error. in prodigy as well as in spacy. I hope i am not making any stupid mistakes. my training data looks like this:

{"id": "3b48a9d3ac9385cc8e3888d63ffe2dd2", "text": "Correlated BHA across interval 3030 - 3092 m . Set plug with mid packer element at 3083 m . Lost 300 lbs of weight after setting sequence was complete. Tagged plug", "spans": [{"dict_id": "2921", "text": "BHA", "start": 11, "end": 13, "token_start": 1, "token_end": 1, "label": "Equipment"}, {"dict_id": "2889", "text": "plug", "start": 51, "end": 54, "token_start": 10, "token_end": 10, "label": "Equipment"}, {"dict_id": "4050", "text": "mid packer element", "start": 61, "end": 78, "token_start": 12, "token_end": 14, "label": "Equipment"}, {"dict_id": "3158", "text": "Lost", "start": 92, "end": 95, "token_start": 19, "token_end": 19, "label": "Well Problem"}, {"dict_id": "3429", "text": "complete", "start": 142, "end": 149, "token_start": 28, "token_end": 28, "label": "Action"}, {"dict_id": "3625", "text": "Tagged", "start": 152, "end": 157, "token_start": 30, "token_end": 30, "label": "Action"}, {"dict_id": "2889", "text": "plug", "start": 159, "end": 162, "token_start": 31, "token_end": 31, "label": "Equipment"}]}
{"id": "2e430b3a22bd3d0cd94daae625e1a7da", "text": "Waited on weather to pull production riser.  Meanwhile: RIH below plug setting depth and confirmed plug free. Flow checked well for 15 min. Well stable.", "spans": [{"dict_id": "3597", "text": "Waited on weather", "start": 0, "end": 16, "token_start": 0, "token_end": 2, "label": "Action"}, {"dict_id": "2967", "text": "pull", "start": 21, "end": 24, "token_start": 4, "token_end": 4, "label": "Action"}, {"dict_id": "3618", "text": "production riser", "start": 26, "end": 41, "token_start": 5, "token_end": 6, "label": "Equipment"}, {"dict_id": "2909", "text": "RIH", "start": 56, "end": 58, "token_start": 11, "token_end": 11, "label": "Action"}, {"dict_id": "2889", "text": "plug", "start": 66, "end": 69, "token_start": 13, "token_end": 13, "label": "Equipment"}, {"dict_id": "2889", "text": "plug", "start": 99, "end": 102, "token_start": 18, "token_end": 18, "label": "Equipment"}, {"dict_id": "2951", "text": "Flow checked", "start": 110, "end": 121, "token_start": 21, "token_end": 22, "label": "Action"}]}
{"id": "ba3fde50c8aa28c16927e1a3dff3891c", "text": "Ran in with test plug and jet sub from surface to 135m. Washed down to 142m at 1700 lpm and 75 rpm.", "spans": [{"dict_id": "2925", "text": "Ran", "start": 0, "end": 2, "token_start": 0, "token_end": 0, "label": "Action"}, {"dict_id": "3074", "text": "test plug", "start": 12, "end": 20, "token_start": 3, "token_end": 4, "label": "Equipment"}, {"dict_id": "3073", "text": "jet sub", "start": 26, "end": 32, "token_start": 6, "token_end": 7, "label": "Equipment"}, {"dict_id": "3478", "text": "Washed down", "start": 56, "end": 66, "token_start": 14, "token_end": 15, "label": "Action"}]}

Even if i give only 1000 lines of these examples into dataset for batch train, I am getting the segmentation fault. So, it doesn’t look like it is happening because of the size of the data. The same thing happens with directly using spacy too. (single label as well as multi label).
I am stuck here. could you help me in tracking this down?

Thanks for your patience on this. It’s hard to resolve this while we don’t have a good idea of where the process is failing. Could you stick to using spaCy for the training for now, set the batch size to 1, and print the examples as you train? This way hopefully we can identify which example is failing.

Hi @honnibal
I tried to print it from the “update” function inside language.py. I made all the training data into one batch. it is failing on different sentences at different iterations. I do not see anything specific in these examples. I have given a few in the end of this post.

There was only one time it went through and saved model, but it did not tag the equipments in the test sentences. This is training data for one new label which are tagged based on dictionary. I can send the training data if rqd.

EXMPLES IN WHICH TRAINING BROKE:

  1. RIH with SBT-CCL-GR. Checked pick up weight at 1000 m / 2500 lbs - 2000 m / 3600 lbs - 2500 m / 5600 lbs. At 3000 m pick up weight ( PUW ) was 8000 lbs, logging weight ( LW ) was 7500 lbs. At 3500 m we showed PUW / LW = 8000 / 7500 lbs. At td ( 4036 m WIRELINE depth ) we showed PUW / LW = 10000 / 9500 lbs.

  2. Electrical assignment problems in SCR room for anchor winches.

  3. Fired guns at first interval from 3273 - 3274,5 m. Good indications of guns beeing fired. WHP 37 bar. Positioned guns for second interval from 3288 - 3294 m. Fired guns at second interval, good indication. WHP 37 bar.

  4. Continued to laying out BHA.

Hmm. Did you run this in Prodigy, or spaCy?

spaCy.