Hi @ines:
Even with a smaller dataset, single label, multiple label - i am getting this error. in prodigy as well as in spacy. I hope i am not making any stupid mistakes. my training data looks like this:
{"id": "3b48a9d3ac9385cc8e3888d63ffe2dd2", "text": "Correlated BHA across interval 3030 - 3092 m . Set plug with mid packer element at 3083 m . Lost 300 lbs of weight after setting sequence was complete. Tagged plug", "spans": [{"dict_id": "2921", "text": "BHA", "start": 11, "end": 13, "token_start": 1, "token_end": 1, "label": "Equipment"}, {"dict_id": "2889", "text": "plug", "start": 51, "end": 54, "token_start": 10, "token_end": 10, "label": "Equipment"}, {"dict_id": "4050", "text": "mid packer element", "start": 61, "end": 78, "token_start": 12, "token_end": 14, "label": "Equipment"}, {"dict_id": "3158", "text": "Lost", "start": 92, "end": 95, "token_start": 19, "token_end": 19, "label": "Well Problem"}, {"dict_id": "3429", "text": "complete", "start": 142, "end": 149, "token_start": 28, "token_end": 28, "label": "Action"}, {"dict_id": "3625", "text": "Tagged", "start": 152, "end": 157, "token_start": 30, "token_end": 30, "label": "Action"}, {"dict_id": "2889", "text": "plug", "start": 159, "end": 162, "token_start": 31, "token_end": 31, "label": "Equipment"}]}
{"id": "2e430b3a22bd3d0cd94daae625e1a7da", "text": "Waited on weather to pull production riser. Meanwhile: RIH below plug setting depth and confirmed plug free. Flow checked well for 15 min. Well stable.", "spans": [{"dict_id": "3597", "text": "Waited on weather", "start": 0, "end": 16, "token_start": 0, "token_end": 2, "label": "Action"}, {"dict_id": "2967", "text": "pull", "start": 21, "end": 24, "token_start": 4, "token_end": 4, "label": "Action"}, {"dict_id": "3618", "text": "production riser", "start": 26, "end": 41, "token_start": 5, "token_end": 6, "label": "Equipment"}, {"dict_id": "2909", "text": "RIH", "start": 56, "end": 58, "token_start": 11, "token_end": 11, "label": "Action"}, {"dict_id": "2889", "text": "plug", "start": 66, "end": 69, "token_start": 13, "token_end": 13, "label": "Equipment"}, {"dict_id": "2889", "text": "plug", "start": 99, "end": 102, "token_start": 18, "token_end": 18, "label": "Equipment"}, {"dict_id": "2951", "text": "Flow checked", "start": 110, "end": 121, "token_start": 21, "token_end": 22, "label": "Action"}]}
{"id": "ba3fde50c8aa28c16927e1a3dff3891c", "text": "Ran in with test plug and jet sub from surface to 135m. Washed down to 142m at 1700 lpm and 75 rpm.", "spans": [{"dict_id": "2925", "text": "Ran", "start": 0, "end": 2, "token_start": 0, "token_end": 0, "label": "Action"}, {"dict_id": "3074", "text": "test plug", "start": 12, "end": 20, "token_start": 3, "token_end": 4, "label": "Equipment"}, {"dict_id": "3073", "text": "jet sub", "start": 26, "end": 32, "token_start": 6, "token_end": 7, "label": "Equipment"}, {"dict_id": "3478", "text": "Washed down", "start": 56, "end": 66, "token_start": 14, "token_end": 15, "label": "Action"}]}
Even if i give only 1000 lines of these examples into dataset for batch train, I am getting the segmentation fault. So, it doesn’t look like it is happening because of the size of the data. The same thing happens with directly using spacy too. (single label as well as multi label).
I am stuck here. could you help me in tracking this down?