New language model for NER

Hi,

I am trying to train a NER language model for the Danish language. I'm trying to recognize the entity profession. I've initiated the model from the spacy CLI and with fasttext word vectors.

What is the recommended way to continue forward? :slight_smile:

I've annotated around 1k annotations from terms.teach. Correct professions have a score of around 0.46 and those that is not correct around 0.39. Not sure how to use this pretraining together with NER.

Any recommendations? :slight_smile:

Just to clarify: In this step, you didn't actually train or modify the vectors in the model – you just used them to create a terminology list for the new entity type that you can use to help label the data (and of course get a better feeling for the vectors and how useful they are).

If you're starting from scratch for a completely new language, I do think it'd probably make sense to label manually – at least in the beginning. So you'd use a recipe like ner.manual to start labelling – or ner.make-gold if your model already assigns entities to the doc.ents and you want to keep them in the data or correct them if necessary.

To make this easier, you can also use the terms you created previously to pre-highlight spans. Even if your terms only cover half of the entities in the data, that's still 50% less work for you. A nice trick you can use is spaCy's new EntityRuler (see here for details). It takes the same pattern files as Prodigy, so you can use terms.to-patterns to create the patterns, add them to the entity ruler, add the entity ruler to your model and save it out. Your temporary model with entity rules will now set the doc.ents – and Prodigy's ner.make-gold recipe will pre-highlight every entity that's already present in the doc.ents.

Basically, something like this:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("your_model")
new_ruler = EntityRuler(nlp).from_disk("./patterns.jsonl")
nlp.add_pipe(ruler)
nlp.to_disk("/path/to/your_model_with_rules")

your_model_with_rules is basically an intermediate dummy model that helps you label more efficiently.

prodigy ner.make-gold ner_dataset /path/to/your_model_with_rules your_data.jsonl --label PROFESSION

Once you've labelled a buch of examples, you can run a first ner.batch-train experiment to see how the model is learning from the data. If your data contains all entities you want to train (i.e. is "gold standard"), make sure to set the --no-missing flag so spaCy can take advantage of knowing that all unannotated tokens are outside an entity (and not missing values).

That's perfect! Thank you

1 Like