Missing entity result

kim_ds · August 24, 2022, 12:11pm

Hello,
I trained a model using the transformer-weights. Overall results were great HOWEVER while training one of my entities was dropped (I'm training on 3 labels and only 2 showed results). the one that didn't show results had a poor representation. is there a parameter that i might be missing in the config file?
label a-1000 tags
label b-600 tags
label c-300 tags

when I look at meta.json, only 2 results appear.

ryanwesslen · August 24, 2022, 3:13pm

hi @kim_ds!

Interesting problem! (And thanks for joining the Prodigy Community )

Is this the approx distribution of your tags? That is, you have ~1,000 annotations for label A, ~600 annotations for label B, etc.?

If not, can you provide the distributions?

Nothing stands out for now. I have a few questions:

How are you training: prodigy train or spacy train?

If you're using prodigy train, after training, did you add the argument --label-stats to print the label stats? Please provide if so

Are you providing a custom spaCy config file? If so, can you provide details?
How are you setting your evaluation dataset?

If you're using prodigy train and you don't specify a dedicated hold out (eval) dataset, it will automatically create one for you. It's usually best practice to create a dedicated dataset and pass through eval: dataset prefix. I'm thinking there could be a chance if you created on your own there's an error with your holdout (e.g., forgot to include one of the labels when doing processing).

If you're using spaCy projects (aka have a config.cfg file), can you run spacy debug data?

This will print a helpful output including NER label details that spaCy is reading like below.

python -m spacy debug data ./config.cfg

...

========================== Named Entity Recognition ==========================
ℹ 18 new labels, 0 existing labels
528978 missing values (tokens with '-' label)
New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL'
(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122),
'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC'
(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338)
✔ Good amount of examples for all labels
✔ Examples without occurences available for all labels
✔ No entities consisting of or starting/ending with whitespace
...

How did you get the annotations? Did you use a Prodigy recipe or create them some other way?

Wondering if created externally, there could have been an issue with formatting the data.

Thank you!

kim_ds · August 24, 2022, 3:46pm

thanks for the quick response!

I do get the following error when running your code above:

thinc.config.ConfigValidationError:

Config validation error

dev -> gold_preproc field required

{'readers': 'spacy.Corpus.v1', 'path': '/home/ec2-user/upper_OB_spacy/dev.spacy', 'max_length': 0}

when I change the max length, i still get gold_preproc field required.

Thank you so much for the quick response. this is already helpful.

so class A has 1000 annotations for the label A. class B has 600 annotations for the label and Class C has 300. Annotations were created through prodigy.

The command i'm running is:
python3 -m spacy train config.cfg --output ./output_upper_1000 --paths.train ./train.spacy --paths.dev ./dev.spacy --gpu-id 0

The config is as follows (with at signs removed due to limits of the reply):

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false

[components.transformer.model.get_spans]
span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
readers = "spacy.Corpus.v1"
path = "/home/ec2-user/upper_OB_spacy/dev.spacy"
max_length = 0

[corpora.train]
@readers = "spacy.Corpus.v1"
path = "/home/ec2-user/upper_OB_spacy/train.spacy"
max_length = 0

kim_ds · August 24, 2022, 3:55pm

found the debug issue, this was helpful! working through the rest now

kim_ds · August 24, 2022, 4:08pm

@ryanwesslen

my jsonl file shows that I have almost 2000 training documents however your command shows the following:

Language: en
Training pipeline: transformer, ner
21 training docs
5 evaluation docs
No overlap between training and evaluation data
✘ Low number of examples to train a new pipeline (21)

============================== Vocab & Vectors ==============================
15593 total word(s) in the data (2690 unique)
No word vectors present in the package

========================== Named Entity Recognition ==========================
2 label(s)
0 missing value(s) (tokens with '-' label)
Low number of examples for label 'om' (44)
huggingface/tokenizers: The current process just got forked, after parallelism h as already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Examples without occurrences available for all labels
No entities consisting of or starting/ending with whitespace
No entities crossing sentence boundaries

================================== Summary ==================================
6 checks passed
1 warning
✘ 1 error

ryanwesslen · August 24, 2022, 9:33pm

hi @kim_ds!

Glad we're making progress!

So it definitely looks like the problem is somewhere with the data, perhaps the formatting. Your .jsonl file has 2000 docs but spaCy is only reading 26 docs (21 for training / 5 for eval). So I definitely think there's something with spaCy reading the data.

I would recommend looking at the spaCy code for the data debug:

github.com

explosion/spaCy/blob/f00254ae276eca963991efb8a45748b2948b1c77/spacy/cli/debug_data.py#L85


      
              debug_data(
                  config_path,
                  config_overrides=overrides,
                  ignore_warnings=ignore_warnings,
                  verbose=verbose,
                  no_format=no_format,
                  silent=False,
              )
          
          

          
def debug_data(
              config_path: Path,
              *,
              config_overrides: Dict[str, Any] = {},
              ignore_warnings: bool = False,
              verbose: bool = False,
              no_format: bool = True,
              silent: bool = True,
          ):
              msg = Printer(
                  no_print=silent, pretty=not no_format, ignore_warnings=ignore_warnings

Maybe try to replicate the steps in the code above on your dataset (e.g., how spaCy imports your data and comes to the conclusion you only have 26 docs / 2 labels). By seeing that, hopefully you may see some issue with how your input data is formatted. Ultimately, you want that data debug step to read in all 2,000 docs + 3 labels.

You may also find the spaCy docs on training data to be helpful. Per the docs, NER training data should be in either of these two formats:

# Training data for an entity recognizer (option 1)
doc = nlp("Laura flew to Silicon Valley.")
gold_dict = {"entities": ["U-PERS", "O", "O", "B-LOC", "L-LOC"]}
example = Example.from_dict(doc, gold_dict)

# Training data for an entity recognizer (option 2)
doc = nlp("Laura flew to Silicon Valley.")
gold_dict = {"entities": [(0, 5, "PERSON"), (14, 28, "LOC")]}
example = Example.from_dict(doc, gold_dict)

Hope this helps and let me know if you're able to find out any more details.

kim_ds · August 29, 2022, 1:31pm

@ryanwesslen this was super helpful. my issue was a misused data-to-spacy. its all going well now. thanks!

ryanwesslen · August 29, 2022, 1:33pm

That's great. Best of luck in your project and let us know if you have further problems!

Topic		Replies	Views
Does Prodigy load pre-annotated data? usage , ner , solved	23	2644	October 25, 2018
Labeling sequence labeling (e.g. NER) task from scratch ner , spacy	16	3494	October 22, 2017
Prodigy annotations to SpaCy train spacy	13	5622	January 31, 2018
ner.train number of examples usage , ner	8	1953	August 3, 2018
train ner dataset -> ValueError: too many values to unpack ner , done	6	2638	January 10, 2020

Missing entity result

Related topics