Should I be using --base-model when training my model?

Using patterns and then a new model I've been training (without --base-model specified), I've developed some gold label data for 3,000 documents and a single NER category.

The end goal is to use this model on discussions indexed from the web, first with English documents.

Should I be training this model using my gold label data and specifying one of spacy's en_core_web models as a --base-model?

I've tried it and I think the model is improving, but I experience something strange, so wondering if maybe this is the wrong approach. The strangeness is that when I run ner.correct using the newly base-model-trained pipeline on my remaining dataset to label, it ends up warning:

The model you're using isn't setting sentence boundaries (e.g. via the parser or sentencizer). This means that incoming examples won't be split into sentences.

And when I look at the data in prodigy, sure enough it has started over at the beginning of my dataset, and everything is just the first sentence of each document.

I have to specify --unsegmented if I want to see the full paragraphs again.

Conversely, when I run ner.correct on a model I trained without any --base-model set, the documents are rendered completely and I don't get the warning.

And, if I should be using a --base-model, how does one decide which one to use? I see that en_core_web_trf is more accurate than en_core_web_sm, but what does that typically mean in practice? Is it just training the model that's slower and memory intensive, or will using the trained model also be slower and require more resources?

Could you share the exact commands that you tried running? That might help me reproduce the error that you're seeing. Also, just to confirm, are you running recent versions of Prodigy/spaCy/Python?

In general, you can switch the base model, which is the starting point for the training procedure, and this could result in better/worse-performing models. In particular, the en_core_web_trf typically yields good accuracy, but it does come at the cost of speed. This pipeline contains a transformer model which is typically slower to train and much slower to run in production, typically on the order of 10x. You can read more about the speed details here.

@koaning Sure, first I started with a source containing a bunch of blobs of text in jsonl format (paragraphs of text, but just spaces for separators, no line breaks or anything), and patterns also in a jsonl format:

ner.manual mymodel en_core_web_sm ./source.jsonl --label MYCATEGORY --patterns patterns.jsonl

I then labeled probably 500 source documents this way before training mymodel like this:

prodigy train ./modeltrain1 --ner mymodel

It would then ultimately result in a statement like:

✔ Saved pipeline to output directory
modeltrain1/model-last

Then I would further label using something like:

prodigy ner.correct mymodel modeltrain1/model-last ./source.jsonl --label MYCATEGORY

And would just keep bouncing back and forth between train and ner.correct, incrementing the training directory each time, e.g. modeltrain2, modeltrain3, etc, in case I wanted to return to a previously trained model (I never did, ultimately).

As I got more and more labels, I experimented with using --base-model:

prodigy train ./modeltrain9 --ner mymodel --base-model en_core_web_sm

It seemed to have higher scoring, so I decided to continue forward and use it to label some more, so I tried my usual follow-up ner.correct:

prodigy ner.correct mymodel modeltrain9/model-last ./source.jsonl --label MYCATEGORY

And that's when I would get the warning:

:warning: The model you're using isn't setting sentence boundaries (e.g. via the parser or sentencizer). This means that incoming examples won't be split into sentences.

When I'd load http://localhost:8080 to label, I noticed it started over on my source documents, back to the first one, and only the first sentence of that document was shown. It was like it saw the data only as the first sentence of each document now.

I was then able to work around this by adding the --unsegmented param:

prodigy ner.correct mymodel modeltrain9/model-last ./source.jsonl --label MYCATEGORY --unsegmented

However, I began wondering if I was using --base-model when I shouldn't be, and I don't really understand what's happening under the covers when training with or without the base model.

Hi @tomw!

I think the confusion is because each spaCy model can have different components for each model pipeline.

Your first model, modeltrain1, didn't use en_core_web_sm , so by default the base model was a blank model. The output model had only two components: tok2vec and a new ner, which used only the MYCATEGORY category. Notice it doesn't have a parser or senter (sentence model).

However, you did use the en_core_web_sm for modeltrain9 which kept the original ner model in en_core_web_sm and added a new ner category (MYCATEGORY) to the existing ner levels. I suspect the model accuracy could have been masked by the pre-existing labels (e.g., ORG) in the ner model in en_core_web_sm.

The en_core_web_sm also had a parser component, what is causing the difference in sentence boundaries warning as this does have a parser component in its pipeline.

I would suggest running print(nlp.pipe_names) for each of your models so you can see all of the components of your pipeline. You can disable components you may not want manually.

Here is an excellent recent spaCy short on disabling spaCy components by @koaning:

Let me know if this makes sense and helps!

2 Likes

@ryanwesslen Thank you, this does clear up a lot of confusion.

However, I'm still wondering what is better when training an NER model ... should I be starting from scratch and just training on my gold annotations, or should I use a base model like en_core_web_sm?

Also, I see how to disable pipeline components in my python scripts when using the nlp, but I'm not seeing how (or whether I even should) disable components when running train or ner.correct?

Hi @tomw!

Likely a base model will help by providing word vectors but you'll want to turning off the ner component as your ner model will be trained from scratch. If you don't turn off the ner then you will be adding your new entity to the existing ner entities.

This post shows how to do it (fyi some syntax has changed but general idea is the same):

nlp = spacy.load("en_core_web_sm")
nlp.remove_pipe("ner") # can remove other not used components too
nlp.to_disk("en_core_web_sm_without_ner")

This also solves your second question because by using to_disk(), you can now call this model like any other model for train or ner.correct:

prodigy ner.correct gold_ner en_core_web_sm_without_ner ./news_headlines.jsonl --label PERSON,ORG

As @koaning mentioned, you do have a choice between which base models (word vectors) where speed vs. accuracy may come into play.

If you use en_core_web_sm, this will be the fastest/most compact, but have lower performance. Alternatively en_core_web_trf will give you the greatest accuracy, but you should be cautious as putting transformers into production can be challenging (e.g., need GPU).

One compromise could be to use en_core_web_lg model like in the post above. Like the small model, it is very fast but has better performance than the small model due to a larger set of word vectors. Given the larger set of word vectors, the large model is larger in size (382 MB). Ideally, you can see our experiments with different NER models and you could mimic do your own experiments to determine which performs best on your model.

Let me know if this helps or if you have any other questions!

1 Like

Thank you for your help @ryanwesslen! I tried this approach, removing the ner from both the _sm and _lg models, and then writing to disk and using that new model as my base-model when training. However, when I do this I get this error:

✘ Config validation error
Bad value substitution: option 'width' in section 'components.ner.model.tok2vec' contains an interpolation key 'components.tok2vec.model.encode.width' which is not a valid option name. Raw value: '${components.tok2vec.model.encode.width}'

I've been getting ready to move off training on my personal computer and onto an AWS VM of some sort, so I wanted to just experiment with training _trf locally before I started the process. (My understanding is that _trf will work without GPU, it's just much slower?)

However, when I use that en_core_web_trf as my base model for train, I get this error:

Token indices sequence length is longer than the specified maximum sequence length for this model (670 > 512). Running this sequence through the model will result in indexing errors

I don't experience this error when using _sm or _lg. Do I just need to run this on a machine with a GPU, or is something else going on here?

Thanks!

hi @tomw!

We're making progress but still some work left :slight_smile: .

Let me think about your first question a little more. I just tried to remove ner and I had no issues. Can you run these commands to double check your spacy/prodigy versions? Also make sure what's in your pipeline. Maybe worst case try to remove everything except your tok2vec component and try again.

>>> import spacy
>>> spacy.__version__
'3.2.4'
>>> import prodigy
>>> prodigy.__version__
'1.11.7'
>>> nlp = spacy.load("en_core_web_sm_without_ner")
>>> print(nlp.pipe_names)
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

There was a similar issue with textcat but for that the recommendation was removing tok2vec, which likely won't help you. If you remove tok2vec with the other components you get right back to the blank:en model. It's the tok2vec is what we want to keep.

For your 2nd question, there was a similar GitHub issue on it and related GitHub discussion. This is a warning, not an error, as what it is truncating any of your documents that are over 512 tokens long. Looks like you had record that was 670 tokens. This is a known limitation for most transformers and isn't an issue with running on a CPU.

You are correct that you don't absolutely need a GPU (see this spaCy issue discussion) but likely you'll have issues if you try to put into production. So if you're looking for your model as a test case, then you can try a CPU for now. Perhaps run a few experiments on varying compute time vs. number of records. I would be curious where the bottleneck would occur!