prodigy ner blank vs vectors model

Hello @ines, I am successfully using prodigy and spacy in my projects and i have 25k documents for one particular problem which are domain specific .

prodigy train ner dataset1 en_vectors_web_lg --output dataset1_Model --n-iter 10 --eval-split 0.2 --dropout 0.2
Best F-Score 85.036

prodigy train ner dataset1 blank:en --output dataset1_Model --n-iter 10 --eval-split 0.2 --dropout 0.2
Best F-Score 84.829

i ran prodigy train-curve ner.
50% 94.44 +0.95
75% 94.61 +0.17
100% 94.79 +0.18

I was expecting more accuracy from vectors model. Do you think accuracy will improve if i use spacy pretrain ?. Does pretrain helps in my case?.

One more thing, i am following below workflow. But sometimes accuracy between prodigy train and spacy train differs. Prodigy is a bit higher than spacy. Is there any step missing before using spacy train?
prodigy ner.manual
prodigy train-curve
prodigy train ner
spacy convert
spacy trian

This is difficult to answer because it depends on your use case, the annotated data and how consistent it is, the quality of the vectors and how well the cover your data, what you'd be pretraining on, how you're evaluating your model, and so on. These are all aspects that matter and that you probably want to look into.

How are you evaluating the models when you use spaCy vs. Prodigy? If you're not using a dedicated evaluation set, Prodigy will just hold back a random sample of examples for evaluation. So your results may differ if you're evaluating on different examples.

To convert your data over to spaCy, you probably want to use Prodigy's new data-to-spacy command, which will merge annotations on the same text and output data in spaCy's foromat.

if i want to use pretrain, i am trying to use below commands

python -m spacy pretrain full_raw_data.jsonl en_vectors_web_lg ./spacy_pretrained_model

python -m prodigy ner full_dataset en_vectors_web_lg --output full_dataset_model --init-tok2vec modelXXX.bin

am i right?

Also, since start i am always using CLI commands rather programmatic training. I am more comfortable to set params on CLI so going to production with the train command output models. Do i miss anything by using CLI?.

Yes, that looks correct to me. And we'd definitely recommend using the CLI for training. If you want more options, you might want to use prodigy data-to-spacy and then use spaCy's spacy train command instead.

Thanks a lot for your information. I guess data-to-spacy command will randomize and shuffle as well. right?

what is the best way of doing error analysis in ner?. I am getting 82% on one label which is really important label for my project. I want to see where is that missing 18% and why it can't identify that label? In which docs, that label is not predicting right like that?. there must be something in the docs. can you please point me to the right direction for that?.

If you use data-to-spacy to create both a training and evaluation set, it will shuffle the data before splitting, yes.

A first and very basic approach would be to just run your model over your evaluation data and look at the examples it gets wrong. Sometimes this can already give you important clues – maybe some specific examples of that label are underrepresented in your training data. Maybe there are edge cases you haven't considered. Your also want to make sure that your evaluation data is consistent and representative so the results you're looking at are actually meaningful.

the main problem with that label is it contains longer text more than 100 chars and sometimes contains other label text as well. will that be an issue?

Also I want to try spacy evaluate cli command. does it helps in overseeing that label?

Yes, potentially – it sounds like what you're annotating and training here might be a bit different from what's typically considered named entity recognition, e.g. named "real world objects" like proper nouns where boundaries are important. A single token can also only be part of one entity. That's both things that NER model implementations are designed for, so if your annotations are different, that might explain why you're not seeing good results.

This comment has more background on this:

Examples:

**Nikos is a longtime The Head of Marketing,California at ABCD, has been appointed as the company’s new CMO at California office. **

designation = The Head of Marketing,California
location = California
company = ABCD

There is no word boundary for the designation text so i have to highlight entire text so location will end up part of the designation. I can say 30% annotation data like that and its a data issue. Either i ignore that doc or go like that.

Depending on the context and position in the document, model is predicting fine in some of the cases as a full value "The Head of Marketing,California". But overall accuracy is low 82%. So i didn;t test model fully to find the missing ones and edge cases as you said. I need to.

Does it make sense what i am saying here?