How to overwrite/correct annotations?

I have 3000 examples, and there are three categories. I’ve annotated the samples in a separate notebook with Spacy’s phrasematcher and matcher. My labels are not 100% correct (maybe 70-80%), so I want to correct them in prodigy, and when I have these examples, I want to use ner.teach. But my question is:
How can I inspect and fix every single instance in prodigy?

My dataset is a jsonl file in the following form (I’ve changed the actual texts)

{"text": "cancer type b. lorem ipsum.. ", "spans": [{"start": 14, "end": 48, "tokens_start": 3, "token_end": 6, "label": "DISORDER"}, {"start": 90, "end": 124, "tokens_start": 16, "token_end": 19, "label": "NEG_DISORDER"}, {"start": 170, "end": 189, "tokens_start": 31, "token_end": 33, "label": "DISORDER"}]}
{"text": "sinus rhythm. since the previous measurement - no cancer is seen ", "spans": []}

When im using:

ner.manual my_dataset diseasemodel patientrecors.jsonl --label 
    "DISORDER,NEG_DISORDER,UN_DISORDER"

it says “No tasks available.” in the app.

I can use prodigy ner.make-gold, but I’m not sure if that’s the best way if I want to make sure to label all samples. Can prodigy keep track of fully annotated examples?

Another question: if my model is trained initially with a single label (“DISORDER”), does it automatically add two new tags to the model when I’m using make-gold or manual to (I’m using a blank spacy NER model).

Your workflow and data looks correct, and using ner.manual is definitely the approach I would have suggested :+1: If you load in data in the same format, it should respect already annotated spans.

Could you run the command with PRODIGY_LOGGING=basic and check if anything looks suspicious? And is there any data in my_dataset already? (If you've already annotated those examples, Prodigy will skip them by default.)

1 Like

ehm maybe it’s a stupid question, but how do i use the debugger in windows(anaconda) ?

On Windows, you should be able to use set to define an environment variable. For example:

set PRODIGY_LOGGING=basic
python -m prodigy ner.manual ...

This thread has some more details and examples of environment variables in Windows.

It says:

22:49:16 - RECIPE: Calling recipe 'ner.manual'
Using 3 labels: DISORDER, NEG_DISORDER, UN_DISORDER
22:49:16 - RECIPE: Starting recipe ner.manual
22:49:16 - RECIPE: Loaded model diseasemodel 
22:49:16 - RECIPE: Annotating with 3 labels
22:49:16 - LOADER: Using file extension 'jsonl' to find loader
22:49:16 - LOADER: Loading stream from jsonl
22:49:16 - LOADER: Rehashing stream
22:49:16 - CONTROLLER: Initialising from recipe
22:49:16 - VALIDATE: Creating validator for view ID 'ner_manual'
22:49:16 - DB: Initialising database SQLite
22:49:16 - DB: Connecting to database SQLite
22:49:16 - DB: Loading dataset 'my_dataset' (3111 examples)
22:49:16 - DB: Creating dataset '2019-02-27_22-49-16'
22:49:16 - DatasetFilter: Getting hashes for excluded examples
22:49:16 - DatasetFilter: Excluding 2638 tasks from datasets: my_dataset 
22:49:16 - CONTROLLER: Initialising from recipe

  ?  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

22:49:23 - GET: /project
22:49:23 - Task queue depth is 1
22:49:23 - Task queue depth is 1
22:49:23 - GET: /get_questions
22:49:23 - FEED: Finding next batch of questions in stream
22:49:23 - CONTROLLER: Validating the first batch for session: None
22:49:23 - PREPROCESS: Tokenizing examples
22:49:23 - FILTER: Filtering duplicates from stream
22:49:23 - FILTER: Filtering out empty examples for key 'text'
22:49:30 - RESPONSE: /get_questions (0 examples)

Thanks! And okay, this definitely shows that after loading the file and excluding existing annotations, Prodigy ends up with 0 examples in the stream. Assuming your JSONL file isn't empty, I think the most likely explanation lies here:

It seems like the dataset my_dataset already includes the examples you're looking to annotate, so Prodigy skips them because they were already answered.

Try using a different dataset name, like disease_ner or something descriptive like that. In general, it's always best to use separate datasets for separate projects and experiments. If you export them later on or use them to train a model, you'll always know what's in the dataset and you don't have mixed data from different sources and experiments, which can easily lead to confusing results later on.

1 Like

HI @ines ,
When I follow your method, the existing annotation from my jsonl files is not shown (i.e., I see the text but there is no annotation). Would it be possible to view and correct the annotation from my jsonl file?

Okay, I realized that I am having that issue because I used patterns jsonl file to continue the correction using --patterns.

Hope this is helpful for someone else :slight_smile: