Datasets and using pre-annotated data

Many thanks for your responses, for test,
I read my raw data by prodigy and also add arbitrary labels ( I defined by --label in ner.manual) to that. It shows I am able to do annotation on raw data for an arbitrary set of labels. here I used a json file containing the sentence of my data and then converted to jsonl and every thing was ok.

Now again to my question, since I did the special tokenization on my data by regex and also annotated the data by regex,
I have the annotated data in this format in python:

[[(‘Therefore’, ‘None’),
(‘CD’, ‘GEOM’),
(‘being’, ‘None’),
(‘dropped’, ‘None’),
(‘perpendicular’, ‘None’),
(‘to’, ‘None’),
(‘AB’, ‘GEOM’),
(‘where’, ‘None’),
(‘AD’, ‘GEOM’),
(‘which’, ‘None’),
(‘is’, ‘None’),
(‘half’, ‘None’),
(‘AB’, ‘GEOM’),
(‘is’, ‘None’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘will’, ‘None’),
(‘be’, ‘None’),
(‘3333⅓’, ‘NUM’)],
[(‘Looking’, ‘None’),
(‘this’, ‘None’),
(‘up’, ‘None’),
(‘in’, ‘None’),
(‘a’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’),
(‘we’, ‘None’),
(‘find’, ‘None’),
(‘the’, ‘None’),
(‘angles’, ‘None’),
(‘CAD’, ‘GEOM’),
(‘and’, ‘None’),
(‘CBD’, ‘GEOM’),
(‘to’, ‘None’),
(‘be’, ‘None’),
(“72° 33’”, ‘COORD’)],
[(‘So’, ‘None’),
(‘also’, ‘None’),
(‘at’, ‘None’),
(‘16°’, ‘ANG’),
(‘or’, ‘None’),
(‘17°’, ‘ANG’),
(‘Aquarius’, ‘None’),
(‘with’, ‘None’),
(‘AB’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘1375’, ‘NUM’),
(‘so’, ‘None’),
(‘if’, ‘None’),
(‘AD’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘2750’, ‘NUM’),
(‘showing’, ‘None’),
(“68° 40’”, ‘COORD’),
(‘in’, ‘None’),
(‘the’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’)]]

do you have any suggestion that how can I proceed form here? probably I should make a same format that you mentioned , is there any option that I can use prodigy in any thought

many thanks

Yes, this looks good – now you can write a small function that takes your tokens and outputs them as a dictionary with a "text", "tokens" and "spans". Do you still have the original text with whitespace? Otherwise, you’ll have to reconstruct that by concatenating the token texts.

Everyone’s raw data is different, so there’s no converter that takes exactly what you have and outputs JSON. But Prodigy standardises on a pretty straightforward format, so hopefully it shouldn’t be too difficult to write a function that converts your annotations in Python or any other programming language you like.

If there’s something you can automate (for example, with regex), you definitely want to take advantage of that! The more you can automate or pre-select, the better. This saves you time and reduces the potential for human error :slightly_smiling_face:

Here’s a quick example for a conversion script in Python – I haven’t tested it yet but something like this should work. You take a bunch of regular expressions, match them on all your texts, get the start and end character index and format them as "spans" in Prodigy’s format. At the end, you can export the data to a file data.jsonl.

import re
from prodigy.util import write_jsonl

label = "LABEL"   # whatever label you want to use
texts = []  # a list of your texts
regex_patterns = [
    # your expressions – whatever you need
    re.compile(r"(?:[0-9a-fA-F]{2}[-:]){5}(?:[0-9a-fA-F]{2})")
]

examples = []
for text in texts:
    for expression in regex_patterns:
        spans = []
        for match in re.finditer(expression, text):
            start, end = match.span()
            span = {"start": start, "end": end, "label": label}
            spans.append(span)
        task = {"text": text, "spans": spans}
        examples.append(task)

write_jsonl("data.jsonl", examples)
4 Likes

Now I solved it BY REGEX ! : ) Now,
I have three version of data annotated by three different lables.( it will be revised by annotator on Thursday)

Now before editing by annotaor I used (is it correct?) :

python -m prodigy db-in ner_date_01 NER_date_01.jsonl


python –m prodigy ner.batch-train NER_ASTR_01.jsonl
en_core_web_sm --ASTR --output C:\Users\moha\Documents\Prodigy\model --n-iter 10 --eval-split 0.2    --dropout 0.2

I have kind of result

BEFORE     0.500
Correct    10
Incorrect  10
Entities   2768
Unknown    2758


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         11.611     18         2          1596       0          0.900
02         8.178      18         2          1456       0          0.900
03         5.282      19         1          889        0          0.950
04         4.834      18         2          1117       0          0.900
05         3.829      18         2          1047       0          0.900
06         3.615      20         0          947        0          1.000

could you let me know that I am in correct direction?

should I do it after edit by annotater?

what is the next step, should I merge the three data sets (with different lables) to one?

I am very excited now :slight_smile:

Best

Nice to hear! And yes, this looks good! The results you're seeing right now (90%-100% accuracy) are a bit misleading – it's because you do not have any negative examples. Once you're done with correcting the annotations and have the full dataset you want to train on, you can add --no-missing to ner.batch-train. This will treat all unlabelled tokens as "outside of an entity" (instead of missing values) and give you more reliable results.

Yes, exactly! If you're using the latest Prodigy v1.8, you can also use the db-merge command to merge datasets automatically.

Good luck!

1 Like

Dear Innes,

thank you for your responses, So basically I annotator should correct then each labels and then she should send me jsonl file and then I should run

python -m prodigy db-in ner_date_01 NER_date_01.jsonl


python –m prodigy ner.batch-train NER_ASTR_01.jsonl
en_core_web_sm --ASTR --output C:\Users\moha\Documents\Prodigy\model --n-iter 10 --eval-split 0.2    --dropout 0.2

saved the improved result for each labels and then merge them?

next question, is updating straight-forward? could you let me know the big difference in new version and optimized instruction of update?
last question, kind of related more on NLP?

I want to know your idea about "policy of correcting the annotations "

Imagine my annotator faced with (I have three lables ASTR, DATE, TIME

           136,918, 

or

           8h 20m,

or

          20 Mrach/Febraury 1590

in each case what she should choose as correct label? I mean should she choose for example

 136,918
or 
 136,91, WITH COMA

Last question do you have any suggestion after NER, how can I proceed with semantic analysis based on my annotation? is it better to use Spacy, or you have any other suggestion?

tnx, I am using prodgy 1.71 but I have access to new one 1.8.1.

if I update after correcting annotating all my data , can I use db.merge on those data?

I mean, If I update prodigy it is ok with annotated file by previous version?

Hey Innes,

I did ner.batch.train on my corrected annotation and here is the result

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE     0.007
Correct    21
Incorrect  2780
Entities   2768
Unknown    0


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         2.293      26         48         46         0          0.351
02         1.070      31         39         47         0          0.443
03         0.733      34         38         52         0          0.472
04         0.809      36         45         63         0          0.444
05         0.600      35         44         60         0          0.443
06         0.460      36         43         61         0          0.456
07         0.541      37         46         66         0          0.446
08         0.418      38         45         67         0          0.458
09         0.394      32         45         55         0          0.416
10         0.475      34         44         58         0          0.436

Correct    34
Incorrect  38
Baseline   0.007
Accuracy   0.472

Model: C:\Users\moha\Documents\Prodigy\model_date_02
Training data: C:\Users\moha\Documents\Prodigy\model_date_02\training.jsonl
Evaluation data: C:\Users\moha\Documents\Prodigy\model_date_02\evaluation.jsonl

I got kind of confused, what is actually this model? specially what is this part about

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE     0.007
Correct    21
Incorrect  2780
Entities   2768
Unknown    0

how can I use the result for having better annotation? I am familiar with DL and ML , could you please let me know what do you mean here of accuracy? since we only have one lablel “DATE” what shows this results?

As I mentioned, I am trying to correct the pre-annotated (by regex) data by prodigy label by label.

Now I have done with correcting pre-annotated date related to label -DATE. then I used

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 30


BEFORE     0.007
Correct    21
Incorrect  2780
Entities   2768
Unknown    0


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         2.293      26         48         46         0          0.351
02         1.070      31         39         47         0          0.443
03         0.733      34         38         52         0          0.472
04         0.809      36         45         63         0          0.444
05         0.600      35         44         60         0          0.443
06         0.460      36         43         61         0          0.456
07         0.541      37         46         66         0          0.446
08         0.418      38         45         67         0          0.458
09         0.394      32         45         55         0          0.416
10         0.475      34         44         58         0          0.436
11         0.297      32         50         60         0          0.390
12         0.363      35         42         58         0          0.455
13         0.338      34         40         54         0          0.459
14         0.332      35         42         58         0          0.455
15         0.352      35         47         63         0          0.427
16         0.376      36         44         62         0          0.450
17         0.529      37         42         62         0          0.468
18         0.496      37         44         64         0          0.457
19         0.286      37         44         64         0          0.457
20         0.212      34         47         61         0          0.420
21         0.231      34         44         58         0          0.436
22         0.461      31         49         57         0          0.387
23         0.392      31         45         53         0          0.408
24         0.353      32         41         51         0          0.438
25         0.330      31         49         57         0          0.387
26         0.205      32         48         58         0          0.400
27         0.266      35         46         62         0          0.432
28         0.383      32         52         62         0          0.381
29         0.204      32         48         58         0          0.400
30         0.273      30         46         52         0          0.395

Correct    34
Incorrect  38
Baseline   0.007
Accuracy   0.472

I do not understand why we have
‘’’
Correct 34
Incorrect 38
‘’’

I am sure we have more entities as “DATE” at least 1500

IS IT NECCCERAY This step at all?

next question, imagine I have all my annotated data for labels “ASTR” and “TIME”

if this step is not meaningful, I only need to merge them?

what would be the next step?
my aim is to have a labeled data of all three entity (which is improved by annotator and also prodigy) then I want to do some semantic analysis on my corpus.

If you have three datasets where the same texts are annotated with one label each, you need to get the examples merged so that all three labels are annotated on each example. You should be able to do this automatically, unless there are conflicts in the annotations, in which case you need to resolve those somehow. Prodigy v1.8 has some useful functions for this: the db-merge recipe is one, and also the review interface might be useful as well, especially with a custom recipe.

If you don’t have your texts annotated with all of your labels, it’s still possible to train a model from them, as Prodigy supports texts with missing annotations. This is why the accuracy metrics might be surprising: if you don’t use the --no-missing flag, the model may be making predictions which the annotations do not mark as correct or incorrect.

I know this doesn’t answer all of your questions, but counting back over your last messages, I see 15-20 different questions, depending on how I count them! Even at a few minutes per question, I’m sure you can see that it would take a long time to work through everything you’ve asked.

Since you’re on a research license, I hope that some of your colleagues will be able to help you with the more project-oriented questions about how to approach things like semantic analysis, how to refine your annotation scheme, etc.

many thanks for response, that right, I asked a lot of questions. sorry for those question, however some them related to prodigy.

just regarding annotation this recipe

python -m prodigy ner.batch-train an_ner_astr_01 en_core_web_sm --output \model_astr_01 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing

what is supposed to do?

I mean, I annotated the data by regex, annotator correct those (means add some annotation or edit in prodigy interface, however as we understand after each correction or…she should always press green button) then in this label which is astronomical size (basically 5 or 6 digit number) could you a bit elaborate ner.batch-train in my case? I have this result

(base) C:\Users\moha\Documents\Prodigy>python -m prodigy ner.batch-train an_ner_astr_01 en_core_web_sm --output \model_astr_01 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE     0.000
Correct    0
Incorrect  2919
Entities   2768
Unknown    0


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         96.338     108        60         125        0          0.643
02         95.158     134        32         149        0          0.807
03         95.202     141        22         153        0          0.865
04         94.795     139        21         148        0          0.869
05         95.464     139        21         148        0          0.869
06         93.525     138        21         146        0          0.868
07         93.129     138        24         149        0          0.852
08         94.345     138        25         150        0          0.847
09         94.080     138        24         149        0          0.852
10         95.058     139        24         151        0          0.853

Correct    139
Incorrect  21
Baseline   0.000
Accuracy   0.869

Model: C:\model_astr_01
Training data: C:\model_astr_01\training.jsonl
Evaluation data: C:\model_astr_01\evaluation.jsonl

ner.batch-train will call into spaCy and train a model with your data. This is done in a very similar way to the regular training with spaCy. It then shows you the training results.

Here, you have 6662 sentences and 20% of them are held back for evaluation. Before training, the model didn’t know anything, and after training, the accuracy is 0.8690 which is very promising. You’ve only trained for 10 iterations and the loss is going down. All of this would indicate that your model is able to learn the task.

that great! Many thanks for your prompt response, I know that it is very nice of you to answer my questions, I am very excited now and happy to used Prodigy. Now I have these questions

1-can you interpret a bit the “correct” and incorrect?
2-It is great that I have a nice accuracy here, can you let me know when for some labels accuracy is around .70, how can I improve it in prodigy? only by adding iteration, or changing drop out? there is no way to see the structure of network and change it , or should we back to annotation more
3-Imagine I add all labels and each of then has accuracy around .90, can we say the model for merged file also has the same accuracy?

Now, I get the answer of my first question :), I guess the correct after using model refer to testing set and before it is all entities . Maybe it would be more clear if you separate the result based on training set and testing set …many tnx

I’m sorry but we really can’t provide this level of support, especially for a free research license. If your colleagues can’t help you, you might be best off looking for a paid consultant. You could post a request here: spaCy/prodigy consultants?

thank you for your response. I can see. I mange most of the steps by your responses and my effort. Now I have nice results for most of the labels, if ti was necessary , for sure

hey, i am trying to change annotation of coco dataset which is in .json format for downsampled coco images.
can anybody help me?
thank you

Hi! The docs on the image data format are a good place to start: Annotation interfaces · Prodigy · An annotation tool for AI, Machine Learning & NLP

Also see the documentation on using Prodigy for computer vision annotation: Computer Vision · Prodigy · An annotation tool for AI, Machine Learning & NLP

I just made a demo COCO display recipe that allows me to simply visualize the dataset and review it for correctness

It might be a good starting point for your own workflow

2 Likes

Awesome thanks for sharing!