Datasets and using pre-annotated data

I finally used your example data set and it works, I called that new_dataset, after annotation where can I find that?

my main question, imagine I have the pre-annotated data in format pickle,(json) how can I use prodigy to improve annotation?

The data you're annotating will be saved to a dataset in the Prodigy database. To export the annotations, you can use the db-out command:

prodigy db-out new_dataset > annotations.jsonl

This depends on what you want to do and what your goal is. Do you want to train a machine learning model? Do you want to correct labelled data? If you want to improve the existing annotations by and correct them, you can convert them to Prodigy's format and load them in using a recipe like ner.manual and re-annotate them.

You might also want to check out the PRODIGY_README.html, which is available for download with Prodigy. It includes the detailed documentation and also an overview of the JSON format that Prodigy reads and creates.

we have the data which is annotated by regex, we want to add some new labels to that and then use new annotated data for training a spacy model,

I read some of your comments, still need to know , which format should I provide to feed to your interface

I have now access to this format:

pickel file of annotated text

can you help me a bit in this area of adding new labels to pre-annotated data and also imrove annotation task provided by regex

best

If you look at the “Annotation task formats” section in your PRODIGY_README.html, you’ll find the exact JSON format that Prodigy expects for pre-annotated data for the different annotation types (NER, text classification etc.). The format should be pretty straightforward: for each example, you usually have a "text" and then either a "label" or "spans", depending on what you’re annotating. You can then convert your pre-annotated data accordingly. For example, for named entity recognition, you’ll need the text and the start/end character offsets and labels for the entities in that text.

1 Like

Many thanks for your responses, for test,
I read my raw data by prodigy and also add arbitrary labels ( I defined by --label in ner.manual) to that. It shows I am able to do annotation on raw data for an arbitrary set of labels. here I used a json file containing the sentence of my data and then converted to jsonl and every thing was ok.

Now again to my question, since I did the special tokenization on my data by regex and also annotated the data by regex,
I have the annotated data in this format in python:

[[(‘Therefore’, ‘None’),
(‘CD’, ‘GEOM’),
(‘being’, ‘None’),
(‘dropped’, ‘None’),
(‘perpendicular’, ‘None’),
(‘to’, ‘None’),
(‘AB’, ‘GEOM’),
(‘where’, ‘None’),
(‘AD’, ‘GEOM’),
(‘which’, ‘None’),
(‘is’, ‘None’),
(‘half’, ‘None’),
(‘AB’, ‘GEOM’),
(‘is’, ‘None’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘will’, ‘None’),
(‘be’, ‘None’),
(‘3333⅓’, ‘NUM’)],
[(‘Looking’, ‘None’),
(‘this’, ‘None’),
(‘up’, ‘None’),
(‘in’, ‘None’),
(‘a’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’),
(‘we’, ‘None’),
(‘find’, ‘None’),
(‘the’, ‘None’),
(‘angles’, ‘None’),
(‘CAD’, ‘GEOM’),
(‘and’, ‘None’),
(‘CBD’, ‘GEOM’),
(‘to’, ‘None’),
(‘be’, ‘None’),
(“72° 33’”, ‘COORD’)],
[(‘So’, ‘None’),
(‘also’, ‘None’),
(‘at’, ‘None’),
(‘16°’, ‘ANG’),
(‘or’, ‘None’),
(‘17°’, ‘ANG’),
(‘Aquarius’, ‘None’),
(‘with’, ‘None’),
(‘AB’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘1375’, ‘NUM’),
(‘so’, ‘None’),
(‘if’, ‘None’),
(‘AD’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘2750’, ‘NUM’),
(‘showing’, ‘None’),
(“68° 40’”, ‘COORD’),
(‘in’, ‘None’),
(‘the’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’)]]

do you have any suggestion that how can I proceed form here? probably I should make a same format that you mentioned , is there any option that I can use prodigy in any thought

many thanks

Yes, this looks good – now you can write a small function that takes your tokens and outputs them as a dictionary with a "text", "tokens" and "spans". Do you still have the original text with whitespace? Otherwise, you’ll have to reconstruct that by concatenating the token texts.

Everyone’s raw data is different, so there’s no converter that takes exactly what you have and outputs JSON. But Prodigy standardises on a pretty straightforward format, so hopefully it shouldn’t be too difficult to write a function that converts your annotations in Python or any other programming language you like.

If there’s something you can automate (for example, with regex), you definitely want to take advantage of that! The more you can automate or pre-select, the better. This saves you time and reduces the potential for human error :slightly_smiling_face:

Here’s a quick example for a conversion script in Python – I haven’t tested it yet but something like this should work. You take a bunch of regular expressions, match them on all your texts, get the start and end character index and format them as "spans" in Prodigy’s format. At the end, you can export the data to a file data.jsonl.

import re
from prodigy.util import write_jsonl

label = "LABEL"   # whatever label you want to use
texts = []  # a list of your texts
regex_patterns = [
    # your expressions – whatever you need
    re.compile(r"(?:[0-9a-fA-F]{2}[-:]){5}(?:[0-9a-fA-F]{2})")
]

examples = []
for text in texts:
    for expression in regex_patterns:
        spans = []
        for match in re.finditer(expression, text):
            start, end = match.span()
            span = {"start": start, "end": end, "label": label}
            spans.append(span)
        task = {"text": text, "spans": spans}
        examples.append(task)

write_jsonl("data.jsonl", examples)
4 Likes

Now I solved it BY REGEX ! : ) Now,
I have three version of data annotated by three different lables.( it will be revised by annotator on Thursday)

Now before editing by annotaor I used (is it correct?) :

python -m prodigy db-in ner_date_01 NER_date_01.jsonl


python –m prodigy ner.batch-train NER_ASTR_01.jsonl
en_core_web_sm --ASTR --output C:\Users\moha\Documents\Prodigy\model --n-iter 10 --eval-split 0.2    --dropout 0.2

I have kind of result

BEFORE     0.500
Correct    10
Incorrect  10
Entities   2768
Unknown    2758


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         11.611     18         2          1596       0          0.900
02         8.178      18         2          1456       0          0.900
03         5.282      19         1          889        0          0.950
04         4.834      18         2          1117       0          0.900
05         3.829      18         2          1047       0          0.900
06         3.615      20         0          947        0          1.000

could you let me know that I am in correct direction?

should I do it after edit by annotater?

what is the next step, should I merge the three data sets (with different lables) to one?

I am very excited now :slight_smile:

Best

Nice to hear! And yes, this looks good! The results you're seeing right now (90%-100% accuracy) are a bit misleading – it's because you do not have any negative examples. Once you're done with correcting the annotations and have the full dataset you want to train on, you can add --no-missing to ner.batch-train. This will treat all unlabelled tokens as "outside of an entity" (instead of missing values) and give you more reliable results.

Yes, exactly! If you're using the latest Prodigy v1.8, you can also use the db-merge command to merge datasets automatically.

Good luck!

1 Like

Dear Innes,

thank you for your responses, So basically I annotator should correct then each labels and then she should send me jsonl file and then I should run

python -m prodigy db-in ner_date_01 NER_date_01.jsonl


python –m prodigy ner.batch-train NER_ASTR_01.jsonl
en_core_web_sm --ASTR --output C:\Users\moha\Documents\Prodigy\model --n-iter 10 --eval-split 0.2    --dropout 0.2

saved the improved result for each labels and then merge them?

next question, is updating straight-forward? could you let me know the big difference in new version and optimized instruction of update?
last question, kind of related more on NLP?

I want to know your idea about "policy of correcting the annotations "

Imagine my annotator faced with (I have three lables ASTR, DATE, TIME

           136,918, 

or

           8h 20m,

or

          20 Mrach/Febraury 1590

in each case what she should choose as correct label? I mean should she choose for example

 136,918
or 
 136,91, WITH COMA

Last question do you have any suggestion after NER, how can I proceed with semantic analysis based on my annotation? is it better to use Spacy, or you have any other suggestion?

tnx, I am using prodgy 1.71 but I have access to new one 1.8.1.

if I update after correcting annotating all my data , can I use db.merge on those data?

I mean, If I update prodigy it is ok with annotated file by previous version?

Hey Innes,

I did ner.batch.train on my corrected annotation and here is the result

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE     0.007
Correct    21
Incorrect  2780
Entities   2768
Unknown    0


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         2.293      26         48         46         0          0.351
02         1.070      31         39         47         0          0.443
03         0.733      34         38         52         0          0.472
04         0.809      36         45         63         0          0.444
05         0.600      35         44         60         0          0.443
06         0.460      36         43         61         0          0.456
07         0.541      37         46         66         0          0.446
08         0.418      38         45         67         0          0.458
09         0.394      32         45         55         0          0.416
10         0.475      34         44         58         0          0.436

Correct    34
Incorrect  38
Baseline   0.007
Accuracy   0.472

Model: C:\Users\moha\Documents\Prodigy\model_date_02
Training data: C:\Users\moha\Documents\Prodigy\model_date_02\training.jsonl
Evaluation data: C:\Users\moha\Documents\Prodigy\model_date_02\evaluation.jsonl

I got kind of confused, what is actually this model? specially what is this part about

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE     0.007
Correct    21
Incorrect  2780
Entities   2768
Unknown    0

how can I use the result for having better annotation? I am familiar with DL and ML , could you please let me know what do you mean here of accuracy? since we only have one lablel “DATE” what shows this results?

As I mentioned, I am trying to correct the pre-annotated (by regex) data by prodigy label by label.

Now I have done with correcting pre-annotated date related to label -DATE. then I used

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 30


BEFORE     0.007
Correct    21
Incorrect  2780
Entities   2768
Unknown    0


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         2.293      26         48         46         0          0.351
02         1.070      31         39         47         0          0.443
03         0.733      34         38         52         0          0.472
04         0.809      36         45         63         0          0.444
05         0.600      35         44         60         0          0.443
06         0.460      36         43         61         0          0.456
07         0.541      37         46         66         0          0.446
08         0.418      38         45         67         0          0.458
09         0.394      32         45         55         0          0.416
10         0.475      34         44         58         0          0.436
11         0.297      32         50         60         0          0.390
12         0.363      35         42         58         0          0.455
13         0.338      34         40         54         0          0.459
14         0.332      35         42         58         0          0.455
15         0.352      35         47         63         0          0.427
16         0.376      36         44         62         0          0.450
17         0.529      37         42         62         0          0.468
18         0.496      37         44         64         0          0.457
19         0.286      37         44         64         0          0.457
20         0.212      34         47         61         0          0.420
21         0.231      34         44         58         0          0.436
22         0.461      31         49         57         0          0.387
23         0.392      31         45         53         0          0.408
24         0.353      32         41         51         0          0.438
25         0.330      31         49         57         0          0.387
26         0.205      32         48         58         0          0.400
27         0.266      35         46         62         0          0.432
28         0.383      32         52         62         0          0.381
29         0.204      32         48         58         0          0.400
30         0.273      30         46         52         0          0.395

Correct    34
Incorrect  38
Baseline   0.007
Accuracy   0.472

I do not understand why we have
‘’’
Correct 34
Incorrect 38
‘’’

I am sure we have more entities as “DATE” at least 1500

IS IT NECCCERAY This step at all?

next question, imagine I have all my annotated data for labels “ASTR” and “TIME”

if this step is not meaningful, I only need to merge them?

what would be the next step?
my aim is to have a labeled data of all three entity (which is improved by annotator and also prodigy) then I want to do some semantic analysis on my corpus.

If you have three datasets where the same texts are annotated with one label each, you need to get the examples merged so that all three labels are annotated on each example. You should be able to do this automatically, unless there are conflicts in the annotations, in which case you need to resolve those somehow. Prodigy v1.8 has some useful functions for this: the db-merge recipe is one, and also the review interface might be useful as well, especially with a custom recipe.

If you don’t have your texts annotated with all of your labels, it’s still possible to train a model from them, as Prodigy supports texts with missing annotations. This is why the accuracy metrics might be surprising: if you don’t use the --no-missing flag, the model may be making predictions which the annotations do not mark as correct or incorrect.

I know this doesn’t answer all of your questions, but counting back over your last messages, I see 15-20 different questions, depending on how I count them! Even at a few minutes per question, I’m sure you can see that it would take a long time to work through everything you’ve asked.

Since you’re on a research license, I hope that some of your colleagues will be able to help you with the more project-oriented questions about how to approach things like semantic analysis, how to refine your annotation scheme, etc.

many thanks for response, that right, I asked a lot of questions. sorry for those question, however some them related to prodigy.

just regarding annotation this recipe

python -m prodigy ner.batch-train an_ner_astr_01 en_core_web_sm --output \model_astr_01 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing

what is supposed to do?

I mean, I annotated the data by regex, annotator correct those (means add some annotation or edit in prodigy interface, however as we understand after each correction or…she should always press green button) then in this label which is astronomical size (basically 5 or 6 digit number) could you a bit elaborate ner.batch-train in my case? I have this result

(base) C:\Users\moha\Documents\Prodigy>python -m prodigy ner.batch-train an_ner_astr_01 en_core_web_sm --output \model_astr_01 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing

Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2  Batch size: 16  Iterations: 10


BEFORE     0.000
Correct    0
Incorrect  2919
Entities   2768
Unknown    0


#          LOSS       RIGHT      WRONG      ENTS       SKIP       ACCURACY
01         96.338     108        60         125        0          0.643
02         95.158     134        32         149        0          0.807
03         95.202     141        22         153        0          0.865
04         94.795     139        21         148        0          0.869
05         95.464     139        21         148        0          0.869
06         93.525     138        21         146        0          0.868
07         93.129     138        24         149        0          0.852
08         94.345     138        25         150        0          0.847
09         94.080     138        24         149        0          0.852
10         95.058     139        24         151        0          0.853

Correct    139
Incorrect  21
Baseline   0.000
Accuracy   0.869

Model: C:\model_astr_01
Training data: C:\model_astr_01\training.jsonl
Evaluation data: C:\model_astr_01\evaluation.jsonl

ner.batch-train will call into spaCy and train a model with your data. This is done in a very similar way to the regular training with spaCy. It then shows you the training results.

Here, you have 6662 sentences and 20% of them are held back for evaluation. Before training, the model didn’t know anything, and after training, the accuracy is 0.8690 which is very promising. You’ve only trained for 10 iterations and the loss is going down. All of this would indicate that your model is able to learn the task.

that great! Many thanks for your prompt response, I know that it is very nice of you to answer my questions, I am very excited now and happy to used Prodigy. Now I have these questions

1-can you interpret a bit the “correct” and incorrect?
2-It is great that I have a nice accuracy here, can you let me know when for some labels accuracy is around .70, how can I improve it in prodigy? only by adding iteration, or changing drop out? there is no way to see the structure of network and change it , or should we back to annotation more
3-Imagine I add all labels and each of then has accuracy around .90, can we say the model for merged file also has the same accuracy?

Now, I get the answer of my first question :), I guess the correct after using model refer to testing set and before it is all entities . Maybe it would be more clear if you separate the result based on training set and testing set …many tnx

I’m sorry but we really can’t provide this level of support, especially for a free research license. If your colleagues can’t help you, you might be best off looking for a paid consultant. You could post a request here: spaCy/prodigy consultants?

thank you for your response. I can see. I mange most of the steps by your responses and my effort. Now I have nice results for most of the labels, if ti was necessary , for sure