I have a question about how to best incorporate original training data to avoid catastrophic forgetting. I’ve been tweaking spaCy NER’s GPE tag to better pick up multi-word or hyphenated place names (which come up a lot in Spanish and Arabic names) and on some kinds of short text. After a thousand examples or so, the quality on those improves, but really falls apart on other place names. I’ve licensed the OntoNotes corpus and would like to be able to use its annotations to remind spaCy what other GPEs look like. I can think of two ways to do this:
convert OntoNotes sentences into Prodigy’s format and load some of it into the annotations DB with db-in
export Prodigy’s annotations to the spaCy training format, intermingle with OntoNotes in spaCy format, and train using spaCy.
Do you have advice on which one makes more sense? Can spaCy’s training work on incomplete (non-gold) sentence annotations like the ones Prodigy produces? Is there a Prodigy format --> spaCy format converter?
I think the question applies for pseudo-rehearsal as well, since you’d need some way of intermingling the data.
The spacy train command is probably nicer to work with for larger training tasks, and in future if I hook up some fancy distributed training, that will be the first place it’s done. So I think if you have the rest of the OntoNotes data, converting into spaCy’s format might be better. Then again, spaCy’s format is kind of annoying.
It would be nice if just parsing a lot of text and using that as part of the training were sufficient. I think one thing we’re lacking to make that happen at the moment is “soft training”. If the parser outputs exactly the same distribution as the original, there should be zero gradient. At the moment that’s not the case, because we take the discrete labels produced by the original model as “gold standard”. So we’re not actually training towards the same model that produced those texts. I think our pseudo-rehearsal will work better if we fix this, and allow soft training.
(Actually now that I’m thinking aloud on this…It occurs to me that a very simple and efficient solution should be to record all the activations across the original network. These activations could then be used to train all the internal layers directly, in parallel.)
This is actually a nice idea – maybe Prodigy should expose a helper function for this. The conversion is pretty straightforward, because the annotation tasks already include everything you need. You can also access the Prodigy database from Python, so you won't even have to export your dataset to a file. The (untested) code below should give you spaCy training data in the "simple training style", i.e. with annotations provided as a dictionary:
from prodigy.components.db import connect
def dataset_to_train_data(dataset):
db = connect()
examples = db.get_dataset(dataset)
train_data = []
for eg in examples:
if eg['answer'] == 'accept':
entities = [(span['start'], span['end'], span['label'])
for span in eg['spans']]
train_data.append((eg['text'], {'entities': entities}))
return train_data
The above example will just export each entry in the dataset as a separate training example. If you want to reconcile spans on the same input that were separated into individual tasks, you can use the input hash:
from prodigy.util import INPUT_HASH_ATTR # == '_input_hash', but using the constant is nicer
You can then check eg[INPUT_HASH_ATTR] and merge the spans of annotated examples that refer to the same input.
I just posted some code to mix OntoNotes NER and new Prodigy NER annotations and to train on that: rehearsal.py. OntoNotes is free and I didn’t have any trouble getting a copy by emailing from a university email.
rehearsal.py generates a new Prodigy dataset containing both NER labeled examples from a given dataset, as well as a number of OntoNotes examples per annotation. For example, the following will augment the annotations in the loc_ner_db dataset with OntoNotes annotations:
python rehearsal.py "loc_ner_db" 5
The augmented data is written to a dataset called augmented_for_training, which should be treated as temporary because the script overwrites it each time. NER training can then be performed as usual:
I’m glad LDC are making it easier to get the corpus. When I was a researcher it was still quite annoying. Even though the corpus doesn’t cost money to researchers, there’s still the process of making sure the institution is an LDC member. For commercial usage, the LDC license is fairly expensive (25k).
Just a quick test, but it seems to work well. The OntoNotes-trained models are pretty good at place names (LOC,GPE) in OntoNotes:
Correct 142
Incorrect 12
Accuracy 0.922
…but it’s really bad at place names in Wikipedia articles. I collected around 600 annotations, which the off the shelf model does very poorly on:
Correct 1
Incorrect 96
Accuracy 0.010
Just training on the 600 was enough to get the model to forget how to do news articles. Without any rehearsal, evaluated on OntoNotes after training only on newly collected Wikipedia annotations was enough to create a big loss on OntoNotes:
Using more than 5 old examples per new example would help it remember, but possibly at the cost of learning the new types more slowly.
Re LDC, I found them very helpful and easy to work with. I’m not in the department that holds the LDC membership at my university, but they were willing to set up a new account for me and give me the free corpora I asked for. I’m sure it’s a different story for commercial users, though.
In case others want it, here is the script I used to compare the accuracy of two NER models on two datasets.