NER - SIGSEGV and outputting gold

Hi,

I’ve been experimenting with training a NER system from scratch. I followed the steps in Labeling sequence labeling (e.g. NER) task from scratch to get started, and annotated some examples.

Firstly, the UX for labelling is awesome - way nicer than brat, for instance :slight_smile:. Having labelled these sentences, is there a way to output gold json files? db-out seems to dump the model’s predictions, but not necessarily what labelling has confirmed.

Secondly, when using ner.batch_train I ran into the following error:

Loaded model models/en_ner_test/
Using 20% of examples (771) for evaluation
Using 100% of remaining examples (7324) for training
Dropout: 0.2  Batch size: 128  Iterations: 50


BEFORE     0.135
Correct    315
Incorrect  2019
Entities   897
Unknown    582


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         0.459      277        2057       816        0          0.119
02         0.424      286        2048       865        0          0.123
03         0.378      271        2063       1018       0          0.116
04         0.336      277        2057       1229       0          0.119
05         0.320      279        2055       1157       0          0.120
06         0.289      278        2056       1398       0          0.119
07         0.284      273        2061       1507       0          0.117
08         0.253      266        2068       1440       0          0.114
09         0.221      259        2075       1453       0          0.111
10         0.218      254        2080       1402       0          0.109
11         0.204      246        2088       1435       0          0.105
12         0.191      253        2081       1588       0          0.108
13         0.175      246        2088       1513       0          0.105
14         0.178      251        2083       1544       0          0.108
15         0.162      243        2091       1508       0          0.104
16         0.162      246        2088       1542       0          0.105
17         0.139      242        2092       1567       0          0.104
18         0.159      238        2096       1548       0          0.102
19         0.143      239        2095       1711       0          0.102
20         0.129      229        2105       1444       0          0.098
21         0.128      231        2103       1659       0          0.099
22         0.133      227        2107       1436       0          0.097
 28%|████████████████████████████████████████████████████████████████▌
| 2048/7324 [01:41<04:22, 20.12it/s]fish: 'python -m prodigy ner.batch-tra…' terminated by signal SIGSEGV (Address boundary error)

Hey,

First: your error is a C-level access error. I’ve recently fixed a bug in spaCy around memory allocations — I wasn’t checking that allocations succeeded. Could you check your machine’s memory usage during training, to see if that could be the problem? Otherwise it’s something else.

I thought about this a lot! The source code for prodigy/recipes/ner.py is included, so you can have a look at it if you like to see the different methods that are there. There should also be documentation around them.

The main recipe you want is ner.make-gold. This lets you make several passes over the same text, making annotations that are used to constrain the choice of parses. This method is used because when you reject an entity, we still don’t know what the correct one is – there are lots of other possibilities.

In order to use ner.make-gold, you have to populate a dataset with text first, using db-in. A useful pattern is sometimes to select the texts you want to annotate by training a text classification model.

It’s useful to use ner.print-dataset during this process, to check on the results you’re generating.

I’ve just re-run the training a few times, all of which resulted in the same SIGSEGV, with plenty of unused system memory.

I’ll have a look at ner.make-gold and see how I get on :slight_smile:.

Hm! Guess it’s something else then. We’ll look into it. It’s probably a bug in spaCy’s beam training — it’s a bit more complicated, and much less tested, than the normal training procedure.

Found an error in spaCy’s beam parsing code that looks very relevant. I think this was causing a minor memory leak, too.

It could be that I’ve fixed a different memory error, but I’ll mark this fixed, because I do think the error I’ve fixed is the relevant one.