Evaluation dataset + patterns

Hi there,

I'm using ner.manual with patterns to recognize the company names in the text.

prodigy ner.manual ner_company_names nl_core_news_lg ./assets/raw_text.jsonl --label ORG,PERSON --patterns ./assets/company_name_patterns.jsonl

I have a few questions:

  1. Patterns: Can I edit patterns during the ner.manual annotation while the server is running? If I make changes to the file with patterns, should I restart the server and refresh the browser? What effect will it have on the dataset? (Currently, I'm just restarting the server and refreshing the browser.)

  2. Evaluation dataset: The evaluation file contains a few thousand samples. Do I need to perform any annotation, run training, or should it remain as raw text?

prodigy ner.manual ner_company_names_eval nl_core_news_lg ./assets/raw_text_eval.jsonl --label ORG,PERSON

Raw text line is looking like:

{"text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam faucibus eros aliquam, laoreet magna et, tincidunt arcu.", "meta": {"source": "lipsun", "id": 4450141}}

Thanks in advance.

Hi @nikolaysm and welcome to the forum :wave:

Re. Patterns

It's true that you can't really update the patterns in the built-in NER recipe while the server is running. Restarting it with the updated patterns file and the same target dataset will apply the updated patterns to the unsaved questions in the dataset.
In other words the new patterns won't be re-applied to the already saved examples.

One way around it is to write a custom NER workflow leveraging the new stream_reset feature and feed the new patterns interactively via custom event.
You can see this in action in our ANN plugin that uses custom events for modifying query over the the indexed dataset. The source code of this solution is available here. It's, of course, not exactly the same problem, but maybe it can serve as an inspiration :slight_smile:

Re. Evaluation dataset
The evaluation dataset should contain the right answers, so yes, it should be annotated.

1 Like