Issue with HF NER training

Hello! When i use hf.train.ner on a dataset labelled fully within prodigy with 6 labels, the training never starts and fails during the initialization phase:

Traceback (most recent call last):
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy\", line 50, in <module>
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy\", line 44, in main
    controller = run_recipe(run_args)
  File "cython_src\\prodigy\\cli.pyx", line 123, in prodigy.cli.run_recipe
  File "cython_src\\prodigy\\cli.pyx", line 124, in prodigy.cli.run_recipe
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy_hf\", line 182, in hf_train_ner
    gen_train, gen_valid, label_list, id2lab, lab2id = into_hf_format(train_examples, valid_examples)
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy_hf\", line 67, in into_hf_format
    valid_out = list(generator(valid_examples))
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy_hf\", line 56, in generator
    ner_tags[span['token_start']] = label2id[f"B-{span['label']}"]
KeyError: 'B-housenumber'

It seems like the dataset is coherent since it trains normally with prodigy.train . I thought it might have been an issue with my environment (bad install of transformers, etc), but the error seems to be occuring within prodigy. Any thoughts?

Hi @lamaeldo,

The KeyError you are seeing comes from the plugin not the main Prodigy library. The code of the plugin is open source so you can clone it, modify it for debugging and re-install in your local virtual env by running:

python -m pip install -e prodigy_hf {path to prodigy-hf source code}

What I think is happening here is that the label names are different in the training set and in the evaluation set. Concretely, there will be at least one label in evaluation set that is not present in the training set.

If you look around the line where the error happens you'll see that the label2id dictionary is built only from the training examples. However, it is also used to codify the evaluation set here.
In other words, the function assumes that the labels should be the same in the training set and in the evaluation set.
To confirm that you could add the following lines in place of line 47

train_label_names = get_label_names(train_examples)
valid_label_names = get_label_names(valid_examples)
assert set(train_label_names) == set(valid_label_names), "Label names in train and valid sets don't match."

I would expect this assert statement to fail.

How do you pass the eval set to the recipe? Are you passing the eval dataset explicitly on CLI via eval: prefix or you provide the desired split (thus using the plugin's internal logic to divide the data into train and eval)?

Thanks for your help!
I have looked into this a bit more in depth, and it seems like this error has now somehow disappeared(and yes, both train and validation label lists are identical)
I now get the following error:

  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy_hf\", line 182, in hf_train_ner
    gen_train, gen_valid, label_list, id2lab, lab2id = into_hf_format(train_examples, valid_examples)
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy_hf\", line 66, in into_hf_format
    train_out = list(generator(train_examples))
  File "C:\Users\guill\AppData\Local\Programs\Python\Python310\lib\site-packages\prodigy_hf\", line 53, in generator
    tokens = [tok['text'] for tok in ex["tokens"]]

I have looked at the data that is passed to the generator function, and it seems like quite a few of my examples do not have a "tokens" key, but have a "spans" key, which is strange since I don' think I have ever added spans annotation to the dataset, only NER. I have tried to filter those out, to no avail.

Attached is the output when i print the first 10 elements of valid_examples on line 66 (.txt file uploaded as HTML to circumvent upload restriction).

valid_examples.html (6.1 KB)

Hey @lamaeldo ,

The reason it sometimes fails sometimes not is that the split will be different in each run as the examples are being shuffled so it might be that sometimes you end up with examples for all labels and sometimes there will be imbalance. In any case, this means that some of your categories are porobably not represented well enough so I would recommned analysing how many examples of each category there is and upsample the ones that are poorly represented or at least create a dedicated eval set that you know contains all categories and pass it to the training script with :eval prefix (also in the interest of being able to compare different experminents).

On to your new error:
The hf recipe, indeed expects the tokens to be there on every example. Looking at your data, you have mix there of binary NER annotations and manual NER annotations. The binary ones are the ones without tokens.
The way Prodigy saves NER annotations is under spans atrribute so that is not suprprising that it's there. Both NER and spans are stored as spans.
To be able to use your data with the hf recipe, you would have to some preprocessing to add the tokens to the examples that don't have them . Prodigy is a helper for that: the add_tokens function. Just make sure you use the same model that was used to annotate the spans in your dataset.

import spacy
import srsly
from prodigy.components.preprocess import add_tokens
from import get_stream

nlp = spacy.load(
)  # the pipeline used to create the existing NER annotations
stream = get_stream("your_dataset")
stream.apply(add_tokens, nlp=nlp, stream=stream)
retokenized_examples = [eg for eg in stream]
srsly.write_jsonl("my_retokenized.jsonl", retokenized_examples)

The new retokenized dataset ("my_retokenized_dataset.jsonl") should be usable with the hf recipe. Be mindful, though of the the train/eval split I mentioned above.