'Cannot find label in model' when trying to train from pre-annotated data

usage
ner
solved

#1

I used ‘ner.manual’ to annotate custom labels for my text. I did this for 10 rows of data. Now I want to train a model using ‘ner.teach’ so that prodigy picks up the labels and suggests automatically for me accept/reject and inturn makes my annotation process easier.

From the comments in the link - Does Prodigy load pre-annotated data? ,I used the commands to train the model.

# import your data into Prodigy
prodigy db-in your_dataset_name /path/to/your_data.jsonl

# pre-train a model with your data
prodigy ner.batch-train your_dataset en_core_web_sm --output /path/to/output-model --no-missing

# improve the model interactively
prodigy ner.teach your_new_dataset /path/to/output-model --label EQUIPMENT

However, the model does not pick up my custom labels during training and throws the following error:

**Error: Can't find label 'Testlabel1' in model**

Below is the data format I received using ‘db-out’ command.
(Note: Below is just a sample chunk from my large JSONL and the numbers/text are incorrect, I just wanted to hightlight the data format)

    {'text': 'example sentences and not the actual data, just random text here and labels.',
  '_input_hash': -2143034943,
  '_task_hash': -191726669,
  'tokens': [{'text': 'example', 'start': 0, 'end': 9, 'id': 0},
   {'text': 'of', 'start': 810, 'end': 812, 'id': 136},
   {'text': 'street', 'start': 813, 'end': 819, 'id': 137},
   {'text': 'City', 'start': 820, 'end': 824, 'id': 138},
   {'text': '.', 'start': 824, 'end': 825, 'id': 139}],
  'spans': [{'start': 780,
    'end': 788,
    'token_start': 130,
    'token_end': 131,
    'label': '“testlabel1'},
   {'start': 798,
    'end': 824,
    'token_start': 134,
    'token_end': 138,
    'label': 'testlabel2”'}],
  'answer': 'accept'},

(Ines Montani) #2

Hi! Your workflow sounds good :slightly_smiling_face: I just had a look at your example and the data and noticed 2 things:

Is this an exact copy-pasted excerpt from your data? Because the string labels seem to contain double quotes at the beginning and the end. Is it possible that you accidentally included them in your label set when you passed them in on the command line?

In that case, your model would have learned about “testlabel1 instead of testlabel1. This isn’t very tragic, though, because you can still run a quick search and replace and re-import your data to a new dataset.

I’m not 100% sure but I think labels are case-sensitive. So testlabel1 and Testlabel1 may be considered different labels. In general, we recommend using uppercase labels – this is a convention in many treebanks and also a convention within spaCy. So in your case, you could use TESTLABEL1 or even TESTLABEL_ONE.


#3

@ines Oh yeah, this is the exact command I used to do manual annotation:

prodigy ner.manual testname en_core_web_sm testname.jsonl --label “Test_One_Ok, Test”.

For the second part u mentioned, I would have to try changing labels to all UPPERCASE and test it again.
One more question - Is the data format as expected? Once I complete manual annotation, I can just use ‘db-out’ and save the output and feed it to the model? My assumption was I might have to just take the ‘text’ and ‘spans’ alone and input to the model like below:

{'text': 'example sentences and not the actual data, just random text here and labels.',
  '_input_hash': -2143034943,
  '_task_hash': -191726669,
    'spans': [{'start': 780,
    'end': 788,
    'token_start': 130,
    'token_end': 131,
    'label': '“testlabel1'},
   {'start': 798,
    'end': 824,
    'token_start': 134,
    'token_end': 138,
    'label': 'testlabel2”'}],
  'answer': 'accept'},

(Ines Montani) #4

One quotation mark on each side like "label1, label2" should be fine on the command line. But if you accidentally pass in double quotation marks like ""label1, label2"", that’d become "label1 and label2", since the string you pass in is split on commas. You might also want to double-check that the quotation marks are actually quotation marks and not different unicode characters like or .

If you want to train a model in Prodigy, you don’t even need to export the data. You can just give ner.batch_train the name of the dataset containing your annotations.

For NER, the "text" and "spans" property (and usually the hashes) is the minimum information needed. But it’s no problem if there’s other information in your data. You can even add your own custom metadata that will be passed through (like an internal ID or something).

You can also export a dataset to a JSONL file and then load that JSONL file back in to annotate it further. The input and output format are fully compatible. For example, you might want to correct existing manual NER annotations, or add another new label to the data.


#5

@ines Sure, Thanks for the info. I am sure it was just one quote (“test1, test2”) that I had given. To avoid this confusion, I should define them without the quotes. Again, not sure if its a syntax or can I ignore them.


(Ines Montani) #6

Yes, you can also just write it like LABEL1,LABEL2. I usually prefer that syntax, too. The quotes are just a way to tell the command line script that something is one string. If you put --label LABEL1, LABEL2, LABEL2 would be interpreted as the next argument, which is obviously not what we want.


#7

@ines I tried changing the labels to UPPERCASE. This is the steps I have done.

prodigy db-in testone /home/user/sample.jsonl  

#In the above command, ‘testone’ is the one with annotations and sample.jsonl is the data I need to annotate

prodigy ner.batch-train testone en_core_web_sm --output /home/user --no-missing #Again 'testone' here

prodigy ner.teach sampletwo /home/user --label LABELONE,LABELTWO  #'sampletwo' is the new data

Using 2 labels: LABELONE,LABELTWO

Although I get the above message 'using 2 labels…" It does not load anything after that for so long, Once I hit enter, I get the below error.

Traceback (most recent call last):
  File "cython_src/prodigy/components/loaders.pyx", line 117, in prodigy.components.loaders.JSONL
ValueError: Expected object or value

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 55, in prodigy.core.Controller.__init__
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/site-packages/toolz/itertoolz.py", line 368, in first
    return next(iter(seq))
  File "cython_src/prodigy/core.pyx", line 84, in iter_tasks
  File "cython_src/prodigy/components/sorters.pyx", line 136, in __iter__
  File "cython_src/prodigy/components/sorters.pyx", line 51, in genexpr
  File "cython_src/prodigy/models/ner.pyx", line 265, in __call__
  File "cython_src/prodigy/models/ner.pyx", line 233, in get_tasks
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "cython_src/prodigy/models/ner.pyx", line 192, in predict_spans
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "cython_src/prodigy/components/preprocess.pyx", line 36, in split_sentences
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/site-packages/spacy/language.py", line 548, in pipe
    for doc, context in izip(docs, contexts):
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/site-packages/spacy/language.py", line 572, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 367, in pipe
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "nn_parser.pyx", line 367, in pipe
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "pipeline.pyx", line 431, in pipe
  File "cytoolz/itertoolz.pyx", line 1047, in cytoolz.itertoolz.partition_all.__next__
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/site-packages/spacy/language.py", line 746, in _pipe
    for doc in docs:
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/site-packages/spacy/language.py", line 551, in <genexpr>
    docs = (self.make_doc(text) for text in texts)
  File "/opt/anaconda/envs/documentparsing/lib/python3.6/site-packages/spacy/language.py", line 544, in <genexpr>
    texts = (tc[0] for tc in text_context1)
  File "cython_src/prodigy/components/preprocess.pyx", line 35, in genexpr
  File "cython_src/prodigy/components/filters.pyx", line 35, in filter_duplicates
  File "cython_src/prodigy/components/filters.pyx", line 16, in filter_empty
  File "cython_src/prodigy/components/loaders.pyx", line 22, in _rehash_stream
  File "cython_src/prodigy/components/loaders.pyx", line 125, in JSONL
ValueError: Failed to load task (invalid JSON).


  ...

#9
prodigy db-in testone /home/user/sample.jsonl 

Please ignore the explanation for this line in the above comments.
Note: In the above line, ‘sample.jsonl’ is the data I annotated using ner.manual (‘testone’ has the annotations of this data)


(Ines Montani) #10

Okay, so did you already figure this out or not?

ValueError: Failed to load task (invalid JSON).

This pretty much always means that there’s invalid JSON in the data and that loading one of the lines in your data failed. Maybe you accidentally included a typo when you replaced the label names? Maybe double-check for trailing commas, unescaped quotes and make sure all quotes are " and not ' (which isn’t allowed in JSON). You can also copy each line into a JSON validator and it should show you what’s wrong.


#11

@ines

It still throws the same error. I checked the data set and it is in correct format. (Also, I am assuming prodigy wouldn’t load it in the first place if there is an error with JSON).

Note: I just manually annotated 2 rows of data just to test one complete flow

Following is the steps that I did:

prodigy ner.manual testtwo en_core_web_sm dataset.jsonl --label TONE,TTWO
Using 2 labels: TONE, TTWO

  ✨  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

Saved 2 annotations to database SQLite
Dataset: testtwo

The second step is:

prodigy db-in testtwo /home/user/dataset.jsonl

  ✨  Imported 2 annotations for 'testtwo' to database SQLite
  Added 'accept' answer to 2 annotations

The third step being the training:

    prodigy ner.batch-train testtwo en_core_web_sm --output /home/user/backup --no-missing

Loaded model en_core_web_sm
Using 50% of accept/reject examples (1) for evaluation
Using 100% of remaining examples (6) for training
Dropout: 0.2  Batch size: 4  Iterations: 10  


BEFORE     0.000     
Correct    0
Incorrect  8
Entities   8         
Unknown    0         

         
#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY  
01         4.473      0          7          7          0          0.000                                                                     
02         3.552      0          7          7          0          0.000                                                                     
03         2.073      0          6          6          0          0.000                                                                     
04         2.221      0          5          5          0          0.000                                                                     
05         2.708      0          2          2          0          0.000                                                                     
06         3.315      0          0          0          0          0.000                                                                     
07         2.667      0          0          0          0          0.000                                                                     
08         2.250      0          0          0          0          0.000                                                                     
09         2.602      0          0          0          0          0.000                                                                     
10         3.555      0          0          0          0          0.000                                                                     

Correct    0
Incorrect  7
Baseline   0.000     
Accuracy   0.000     

Model: /home/user/backup
Training data: /home/user/backup/training.jsonl
Evaluation data: /home/user/backup/evaluation.jsonl

The fourth step is where I get the error:

prodigy ner.teach joboverviews.jsonl /home/user/backup --label TONE,TTWO
Using 2 labels: TONE, TTWO

Traceback (most recent call last):
  File "cython_src/prodigy/components/loaders.pyx", line 117, in prodigy.components.loaders.JSONL
ValueError: Expected object or value

(Ines Montani) #12

The second argument to ner.teach is the name of the dataset you want to save your annotations to – not the file you want to load in. So I think you’re missing the dataset name here.

To see the arguments and docs of a recipe, you can always add --help btw. This will show you the arguments it takes and more information. For example:

prodigy ner.teach --help

You don’t actually need to do this! When you annotate your data in the first step, the annotations will be saved to the dataset testtwo. So they’re already in there and you don’t need to import anything else.

If you’re importing the raw, unannotated dataset.jsonl again on top of the already existing annotations, you’ll end up with duplicate examples in your dataset: one annotated version and one unannotated version. So when you train from that data, you likely won’t ever get good results.


#13

@ines Sure, Thanks for answering all queries patiently.
Finally, I used this command and prodigy could load the data I need to annotate based on the custom labels.

prodigy ner.teach testfour /home/user/backup newdata.jsonl --label TONE,TTWO

Here, ‘testfour’ will be the dataset to save the annotations and ‘newdata.jsonl’ is the data I need to annotate.