How to overwrite/correct annotations?

xraycat123 · February 26, 2019, 1:27pm

I have 3000 examples, and there are three categories. I’ve annotated the samples in a separate notebook with Spacy’s phrasematcher and matcher. My labels are not 100% correct (maybe 70-80%), so I want to correct them in prodigy, and when I have these examples, I want to use ner.teach. But my question is:
How can I inspect and fix every single instance in prodigy?

My dataset is a jsonl file in the following form (I’ve changed the actual texts)

{"text": "cancer type b. lorem ipsum.. ", "spans": [{"start": 14, "end": 48, "tokens_start": 3, "token_end": 6, "label": "DISORDER"}, {"start": 90, "end": 124, "tokens_start": 16, "token_end": 19, "label": "NEG_DISORDER"}, {"start": 170, "end": 189, "tokens_start": 31, "token_end": 33, "label": "DISORDER"}]}
{"text": "sinus rhythm. since the previous measurement - no cancer is seen ", "spans": []}

When im using:

ner.manual my_dataset diseasemodel patientrecors.jsonl --label 
    "DISORDER,NEG_DISORDER,UN_DISORDER"

it says “No tasks available.” in the app.

I can use prodigy ner.make-gold, but I’m not sure if that’s the best way if I want to make sure to label all samples. Can prodigy keep track of fully annotated examples?

Another question: if my model is trained initially with a single label (“DISORDER”), does it automatically add two new tags to the model when I’m using make-gold or manual to (I’m using a blank spacy NER model).

ines · February 26, 2019, 3:37pm

Your workflow and data looks correct, and using ner.manual is definitely the approach I would have suggested If you load in data in the same format, it should respect already annotated spans.

Could you run the command with PRODIGY_LOGGING=basic and check if anything looks suspicious? And is there any data in my_dataset already? (If you've already annotated those examples, Prodigy will skip them by default.)

xraycat123 · February 26, 2019, 7:13pm

ehm maybe it’s a stupid question, but how do i use the debugger in windows(anaconda) ?

ines · February 26, 2019, 8:01pm

On Windows, you should be able to use set to define an environment variable. For example:

set PRODIGY_LOGGING=basic
python -m prodigy ner.manual ...

This thread has some more details and examples of environment variables in Windows.

xraycat123 · February 27, 2019, 9:59pm

It says:

22:49:16 - RECIPE: Calling recipe 'ner.manual'
Using 3 labels: DISORDER, NEG_DISORDER, UN_DISORDER
22:49:16 - RECIPE: Starting recipe ner.manual
22:49:16 - RECIPE: Loaded model diseasemodel 
22:49:16 - RECIPE: Annotating with 3 labels
22:49:16 - LOADER: Using file extension 'jsonl' to find loader
22:49:16 - LOADER: Loading stream from jsonl
22:49:16 - LOADER: Rehashing stream
22:49:16 - CONTROLLER: Initialising from recipe
22:49:16 - VALIDATE: Creating validator for view ID 'ner_manual'
22:49:16 - DB: Initialising database SQLite
22:49:16 - DB: Connecting to database SQLite
22:49:16 - DB: Loading dataset 'my_dataset' (3111 examples)
22:49:16 - DB: Creating dataset '2019-02-27_22-49-16'
22:49:16 - DatasetFilter: Getting hashes for excluded examples
22:49:16 - DatasetFilter: Excluding 2638 tasks from datasets: my_dataset 
22:49:16 - CONTROLLER: Initialising from recipe

  ?  Starting the web server at http://localhost:8080 ...
  Open the app in your browser and start annotating!

22:49:23 - GET: /project
22:49:23 - Task queue depth is 1
22:49:23 - Task queue depth is 1
22:49:23 - GET: /get_questions
22:49:23 - FEED: Finding next batch of questions in stream
22:49:23 - CONTROLLER: Validating the first batch for session: None
22:49:23 - PREPROCESS: Tokenizing examples
22:49:23 - FILTER: Filtering duplicates from stream
22:49:23 - FILTER: Filtering out empty examples for key 'text'
22:49:30 - RESPONSE: /get_questions (0 examples)

ines · February 28, 2019, 8:47am

Thanks! And okay, this definitely shows that after loading the file and excluding existing annotations, Prodigy ends up with 0 examples in the stream. Assuming your JSONL file isn't empty, I think the most likely explanation lies here:

It seems like the dataset my_dataset already includes the examples you're looking to annotate, so Prodigy skips them because they were already answered.

Try using a different dataset name, like disease_ner or something descriptive like that. In general, it's always best to use separate datasets for separate projects and experiments. If you export them later on or use them to train a model, you'll always know what's in the dataset and you don't have mixed data from different sources and experiments, which can easily lead to confusing results later on.

Suneal · September 7, 2021, 5:47pm

HI @ines ,
When I follow your method, the existing annotation from my jsonl files is not shown (i.e., I see the text but there is no annotation). Would it be possible to view and correct the annotation from my jsonl file?

Suneal · September 7, 2021, 5:53pm

Okay, I realized that I am having that issue because I used patterns jsonl file to continue the correction using --patterns.

Hope this is helpful for someone else

Topic		Replies	Views
Load annotated data in .spacy format to Prodigy for further correction	2	310	September 20, 2023
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Best way to re-label / re-annotate existing data based on condition ner	1	421	September 19, 2022
Prodigy Custom Model; Model in the Loop (matcher) usage , ner , solved	2	752	August 10, 2021
revising annotation by prodigy--here only one label (DATE) usage , ner , solved	16	1930	May 20, 2019

How to overwrite/correct annotations?

Related topics