I tried using a blank model, but for some reason it gave me accuracy of 0.00 when running ner.batch-train
after doing a little more than 1000 annotations.
When creating the blank model, I got the error described here, I checked my spaCy version which is 2.0.12 - so I worked around that and ended up with this:
from __future__ import unicode_literals, print_function
from pathlib import Path
import shutil
import spacy
def main(output_dir=None):
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
if 'ner' not in nlp.pipe_names:
print("Adding ner pipe")
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
ner = nlp.get_pipe('ner')
nlp.vocab.vectors.name = 'en_core_web_lg.vectors'
optimizer = nlp.begin_training();
losses = {}
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
for i in range(1):
nlp.update(
[], # batch of texts
[], # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
else:
shutil.rmtree(output_dir)
nlp.meta['name'] = 'blank_ner_model' # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
if __name__ == '__main__':
main('./blankv1')
I tried en_core_web_sm
, and while it is faster to work with, accuracy and annotation quality suffers a little (as far as I can see).
I generated match patterns for Manufacturer Serial Number (MSN) to feed into Prodigy, and after doing ~600 annotations and training using en_core_web_lg
I am up to 92,6% accuracy for the MSN model.
This will probably be my next attempt at getting it to >95% accuracy