Ner teach not working

Hi all

I faced this issue:

✘ Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.

While running :

prodigy ner.teach sample $MODEL_FILE $DATA_FILE --label PersonName,Organisation,TeamId,Email,TelNumber,UserId --patterns $PATTERN_FILE

I added pattern file because the model didn't learn some entities of the taxonomy.

Could you help with it please,

Thank you

Best regards
Julie

hi @JulieSarah,

Typically, this means there's nothing to load from the file. Can you provide an example of your data? You can remove any sensitive data. It's common like the answers below to have a small mistake in your input data.

Unfortunately there are 44 issues that mention a similar error so there could a variety of issues.

Let me know if you can provide an example and then we can go from there. Also, let us know if you can do logging and provide back any details.

Here is a sample of my jsonl. For csv I got the same issue.

> {'text': 'blabla "A" FAULT', 'Meta': 'id1'}
> {'text': 'bla', 'Meta': 'id2'}
> {'text': 'blablabla', 'Meta': 'id6'}

From the logging, i had at the end:

> 14:55:03: DB: Initializing database SQLite
> 14:55:03: DB: Connecting to database SQLite
> 14:55:03: DB: Creating dataset '2022-11-15_14-55-03'
> {'created': datetime.datetime(2022, 11, 15, 10, 53, 53)}
> 
> 14:55:03: FEED: Initializing from controller
> {'auto_count_stream': False, 'batch_size': 5, 'dataset': 'sample_part_installations_oso_compatible_manufacturer_part_description_filtered_teach', 'db': <prodigy.components.db.Database object at 0x7fc75a9121f0>, 'exclude': ['sample_part_installations_oso_compatible_manufacturer_part_description_filtered_teach'], 'exclude_by': 'task', 'max_sessions': 10, 'overlap': False, 'self': <prodigy.components.feeds.Feed object at 0x7fc75ae00a60>, 'stream': <generator object teach.<locals>.<genexpr> at 0x7fc75afea890>, 'target_total_annotated': 0, 'timeout_seconds': 3600, 'total_annotated': 0, 'total_annotated_by_session': Counter(), 'validator': <prodigy.components.validate.Validator object at 0x7fc75a912310>, 'view_id': 'ner'}
> 
> 14:55:03: PREPROCESS: Splitting sentences
> {'batch_size': 32, 'min_length': None, 'nlp': <spacy.lang.en.English object at 0x7fc75ae00dc0>, 'no_sents_warned': False, 'stream': <generator object at 0x7fc75afe75e0>, 'text_key': 'text'}
> 
> 14:55:03: CONFIG: Using config from global prodigy.json
> /local/home/ta-2f93-titan-jma/.prodigy/prodigy.json
> 
> 14:55:03: CONFIG: Using config from working dir
> /data/neras/neras_annotator/.prodigy.json
> 
> 14:55:03: FILTER: Filtering duplicates from stream
> {'by_input': True, 'by_task': True, 'stream': <generator object at 0x7fc75afe74a0>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7fc7740fc4c0>>, 'warn_threshold': 0.4}
> 
> 14:55:03: FILTER: Filtering out empty examples for key 'text'
> ```

Okay - yes, I had to modify your .jsonl a bit for formatting, but yes, when I used this data:

{"text": "blabla \"A\" FAULT", "meta": {"id": "id1"}}
{"text": "bla", "meta": {"id": "id2"}}
{"text": "blablabla", "meta": {"id": "id6"}}

and I ran this command (since I didn't have your model used in the recipe, I used en_core_web_sm as an alternative):

python -m prodigy ner.teach sample en_core_web_sm data/sample.jsonl --label PERSON 

I got this error:

✘ Error while validating stream: no first example
This likely means that your stream is empty.This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.

However, like the error says, I think the problem is that you're not getting any predicted entities or pattern matches, hence the stream is empty.

For example, if I modify the data to:

{"text": "blabla \"A\" FAULT", "meta": {"id": "id1"}}
{"text": "Joe Biden is president of the United States.", "meta": {"id": "id2"}}
{"text": "blablabla", "meta": {"id": "id6"}}

It does load up one example.

I suspect your problem is not enough of the predictions on your data is hitting the active learning criteria. ner.teach is really just running this (if we ignore patterns):

from prodigy.components.sorters import prefer_uncertain

def score_stream(stream):
    for example in stream:
        score = model.predict(example["text"])
        yield (score, example)

stream = prefer_uncertain(score_stream(stream))

When you run this, prefer_uncertain isn't returning any of your predictions from your stream as the default algorithm ema. It tracks the exponential moving average of the uncertainties, and also tracks a moving variance. It then asks questions which are one standard deviation or more above the current average uncertainty.

What you may want to do is modify the algorithm (or bias sorter) in the function prefer_uncertain.

Here's more of a background:

FYI to find your local recipe, run python -m prodigy stats, find your Location: path, then find the file recipes/ner.py where you'll find where ner.teach is defined. You can then either modify it directly or copy it to create your modified ner.teach recipe where you can try out a modified version.

Hope this helps!

Thank you @ryanwesslen , so you stated that if I modify the prefer_uncertain function, I won't have this error even if I have no entity on my jsonl?

I don't understand if the error I got is here because none of my jsonl lines have a sole entity or just because the first line one has no entity.

Maybe a workaround would be to filter the sample ensuring I have entities, what would you recommend for that in term of prodigy workflow?

Thank you

Julie

So I can't guarantee you won't -- but as I mentioned, the current default algorithm is ema:

  • prefer_uncertain(stream, algorithm='ema'): This is the default sorter. It tracks the exponential moving average of the uncertainties, and also tracks a moving variance. It then asks questions which are one standard deviation or more above the current average uncertainty.

Therefore, it will only "ask questions" for spans that are one or more standard deviation above the current average uncertainty. I think you're not seeing any examples b/c none of your spans meet this criteria. Hence, why I recommended modifying.

Yes, you could but I would just use the ner.correct which will allow you to correct good and bad examples. Remember, the ner.teach is for modifying the order of what examples you see (active learning). If you don't want to use it, that's completely fine. You're welcome to filter out entities, but remember you do need negative examples (e.g., examples without any entities). So if you label only those with predicted entities, you may not build as robust of a model.