Missing entity result

Hello,
I trained a model using the transformer-weights. Overall results were great HOWEVER while training one of my entities was dropped (I'm training on 3 labels and only 2 showed results). the one that didn't show results had a poor representation. is there a parameter that i might be missing in the config file?
label a-1000 tags
label b-600 tags
label c-300 tags

when I look at meta.json, only 2 results appear.

hi @kim_ds!

Interesting problem! (And thanks for joining the Prodigy Community :wave:)

Is this the approx distribution of your tags? That is, you have ~1,000 annotations for label A, ~600 annotations for label B, etc.?

If not, can you provide the distributions?

Nothing stands out for now. I have a few questions:

  • How are you training: prodigy train or spacy train?

If you're using prodigy train, after training, did you add the argument --label-stats to print the label stats? Please provide if so :slight_smile:

  • Are you providing a custom spaCy config file? If so, can you provide details?

  • How are you setting your evaluation dataset?

If you're using prodigy train and you don't specify a dedicated hold out (eval) dataset, it will automatically create one for you. It's usually best practice to create a dedicated dataset and pass through eval: dataset prefix. I'm thinking there could be a chance if you created on your own there's an error with your holdout (e.g., forgot to include one of the labels when doing processing).

  • If you're using spaCy projects (aka have a config.cfg file), can you run spacy debug data?

This will print a helpful output including NER label details that spaCy is reading like below.

python -m spacy debug data ./config.cfg
...

========================== Named Entity Recognition ==========================
ℹ 18 new labels, 0 existing labels
528978 missing values (tokens with '-' label)
New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL'
(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122),
'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC'
(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
...
  • How did you get the annotations? Did you use a Prodigy recipe or create them some other way?

Wondering if created externally, there could have been an issue with formatting the data.

Thank you!

thanks for the quick response!

I do get the following error when running your code above:

thinc.config.ConfigValidationError:

Config validation error

dev -> gold_preproc field required

{'readers': 'spacy.Corpus.v1', 'path': '/home/ec2-user/upper_OB_spacy/dev.spacy', 'max_length': 0}

when I change the max length, i still get gold_preproc field required.

Thank you so much for the quick response. this is already helpful.

so class A has 1000 annotations for the label A. class B has 600 annotations for the label and Class C has 300. Annotations were created through prodigy.

The command i'm running is:
python3 -m spacy train config.cfg --output ./output_upper_1000 --paths.train ./train.spacy --paths.dev ./dev.spacy --gpu-id 0

The config is as follows (with at signs removed due to limits of the reply):

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
readers = "spacy.Corpus.v1"
path = "/home/ec2-user/upper_OB_spacy/dev.spacy"
max_length = 0

[corpora.train]
@readers = "spacy.Corpus.v1"
path = "/home/ec2-user/upper_OB_spacy/train.spacy"
max_length = 0

found the debug issue, this was helpful! working through the rest now :slight_smile:

@ryanwesslen

my jsonl file shows that I have almost 2000 training documents however your command shows the following:

Language: en
Training pipeline: transformer, ner
21 training docs
5 evaluation docs
:heavy_check_mark: No overlap between training and evaluation data
✘ Low number of examples to train a new pipeline (21)

============================== Vocab & Vectors ==============================
:information_source: 15593 total word(s) in the data (2690 unique)
:information_source: No word vectors present in the package

========================== Named Entity Recognition ==========================
:information_source: 2 label(s)
0 missing value(s) (tokens with '-' label)
:warning: Low number of examples for label 'om' (44)
huggingface/tokenizers: The current process just got forked, after parallelism h as already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
:heavy_check_mark: Examples without occurrences available for all labels
:heavy_check_mark: No entities consisting of or starting/ending with whitespace
:heavy_check_mark: No entities crossing sentence boundaries

================================== Summary ==================================
:heavy_check_mark: 6 checks passed
:warning: 1 warning
✘ 1 error

hi @kim_ds!

Glad we're making progress!

So it definitely looks like the problem is somewhere with the data, perhaps the formatting. Your .jsonl file has 2000 docs but spaCy is only reading 26 docs (21 for training / 5 for eval). So I definitely think there's something with spaCy reading the data.

I would recommend looking at the spaCy code for the data debug:

Maybe try to replicate the steps in the code above on your dataset (e.g., how spaCy imports your data and comes to the conclusion you only have 26 docs / 2 labels). By seeing that, hopefully you may see some issue with how your input data is formatted. Ultimately, you want that data debug step to read in all 2,000 docs + 3 labels.

You may also find the spaCy docs on training data to be helpful. Per the docs, NER training data should be in either of these two formats:

# Training data for an entity recognizer (option 1)
doc = nlp("Laura flew to Silicon Valley.")
gold_dict = {"entities": ["U-PERS", "O", "O", "B-LOC", "L-LOC"]}
example = Example.from_dict(doc, gold_dict)

# Training data for an entity recognizer (option 2)
doc = nlp("Laura flew to Silicon Valley.")
gold_dict = {"entities": [(0, 5, "PERSON"), (14, 28, "LOC")]}
example = Example.from_dict(doc, gold_dict)

Hope this helps and let me know if you're able to find out any more details.

@ryanwesslen this was super helpful. my issue was a misused data-to-spacy. its all going well now. thanks!

That's great. Best of luck in your project and let us know if you have further problems!

1 Like