I’m trying to create a new model but am having trouble saving it to disk. I get the error {AttributeError}'bool' object has no attribute 'to_bytes' after running this code.
model = spacy.blank('en')
model.add_pipe(model.create_pipe('tagger'))
model.add_pipe(model.create_pipe('parser'))
model.add_pipe(model.create_pipe('ner'))
model.add_pipe(model.create_pipe('textcat'))
model.to_disk('test')
I think the NER part is the problem, it saves otherwise. I’ve also tried loading the en_core_web_lg model and replacing the ner object with my own, but have the same saving issue.
Also one other question. If I want certain words to be marked as stop words (beyond the normal ones) how do I do that? If I make a Matcher for these words I see I can add a callback to mark as stop words, but can I include that mater in the pipeline?
Ah, spaCy should probably fail more gracefully here. I think the problem is that when you create a new pipeline component, its model and weights are not initialised yet. The component's model attribute then defaults to True, which causes the error during serialization. See the __init__ method of the EntityRecognizerhere:
model
thinc.neural.Model or True
The model powering the pipeline component. If no model is supplied, the model is created when you call begin_training, from_disk or from_bytes.
If you call nlp.begin_training() before saving the model out to disk, the weights will be initialised and you should be able to save it without problems.
Btw, if you're training a model with Prodigy, you don't actually have to add the empty pipeline components upfront. Prodigy will do this for you during training. So you could also simply save out spacy.blank('en'). (If you're training a text classifier, you even have the option to start off with a blank model by leaving out the spacy_model argument and setting --lang en instead.)
The most straightforward way would be to modify the is_stop attribute on the lexeme in the vocab. When you save out the model, the vocab will be saved with it, which will include your cusom stop words. For example:
for word in ('apple', 'pear', 'banana'):
nlp.vocab[word].is_stop = True
If you want to implement other custom logic using the Matcher – for example, setting custom attributes on tokens – this is possible, too. You'd only have to include your logic with the model. Because spaCy models are Python packages, you can ship any code with them or make them depend on other packages. So you could wrap your logic in a custom pipeline component and add the code to your model's __init__.py. Your model's load() method just needs to return an nlp object with the correct pipeline and data loaded in, and any other modification you want to make to it during loading. See this section in the docs and spaCy's helper function load_model_from_init_py for more details and implementation examples.
Btw, if you’re training a model with Prodigy, you don’t actually have to add the empty pipeline components upfront. Prodigy will do this for you during training. So you could also simply save out spacy.blank('en'). (If you’re training a text classifier, you even have the option to start off with a blank model by leaving out the spacy_model argument and setting --lang en instead.)
I'm working on training a new NER model. When I simply save a blank model and try to work with the ner.teach recipe I get the following error:
KeyError: "No component 'ner' found in pipeline. Available names: ['sbd']"
I am able to run a ner.batch-train with the blank model without a problem. Is it a problem to use different models for the teach and batch-train commands? For now I'm using en_core_web_lg for the training portion (since it works!).
Ah, okay – yes, the ner.teach recipe already expects the model to have an ner component. So if you want to train new categories from scratch and not use any of the built-in labels, your approach is good.
This depends – if the model has pre-trained vectors (like the en_core_web_md or lg models, or a custom one), those will be used as features. This means that the model you use during active learning annotation should have the same vectors available as the model you train later on. At least, this will usually give you better accuracy.
Another thing to consider is that during ner.teach Prodigy will suggest you entities that are consistent with the model's other predictions and constraints – which includes other entity types. For example, let's say you're annotating a new category ANIMAL and you've already labelled enough examples from the patterns for the model to start suggesting things. If you come across a text with an entity like "Fox" and the model already has a very confident ORG prediction for it, it might not ask you about the ANIMAL score. So the training examples you collect will be great for updating the existing model with the existing entity types – but not necessarily the best choice for training a blank model with no other entity types.
Long story short – now that I think about it, your approach definitely makes sense. However, it still depends on your data.
This could be a nice experiment, actually – you could collect a few hundred annotations with both approaches (ner.teach with an empty model and pre-trained model), then train a model from scratch using an empty model and see how the accuracy compares for both training sets.
Good point! I did actually give this a shot, but whenever I train a model with one of the en_core_web models I encounter "catastrophic forgetting" (I think that's the right term?) and it labels everything as a WORK_OF_ART. Whereas if I use the blank model there are a lot of false positives. Guess there's more work to do! Thanks for the answer!
A follow up question: If I start using ner with a blank model does that mean it doesn’t take into account any dependency or pos types? Would I need to add the tagger and parser into the pipeline before it?
I’ve tried doing this so far (with and without the tagger/parser) and it seems like the ner model seems to be having a hard time finding any matches to my patterns file (basic words). Even when words that match my patterns are in the same document, it rarely seems to highlight them for the entity training.