I finally used your example data set and it works, I called that new_dataset, after annotation where can I find that?
my main question, imagine I have the pre-annotated data in format pickle,(json) how can I use prodigy to improve annotation?
I finally used your example data set and it works, I called that new_dataset, after annotation where can I find that?
my main question, imagine I have the pre-annotated data in format pickle,(json) how can I use prodigy to improve annotation?
The data you're annotating will be saved to a dataset in the Prodigy database. To export the annotations, you can use the db-out
command:
prodigy db-out new_dataset > annotations.jsonl
This depends on what you want to do and what your goal is. Do you want to train a machine learning model? Do you want to correct labelled data? If you want to improve the existing annotations by and correct them, you can convert them to Prodigy's format and load them in using a recipe like ner.manual
and re-annotate them.
You might also want to check out the PRODIGY_README.html
, which is available for download with Prodigy. It includes the detailed documentation and also an overview of the JSON format that Prodigy reads and creates.
we have the data which is annotated by regex, we want to add some new labels to that and then use new annotated data for training a spacy model,
I read some of your comments, still need to know , which format should I provide to feed to your interface
I have now access to this format:
pickel file of annotated text
can you help me a bit in this area of adding new labels to pre-annotated data and also imrove annotation task provided by regex
best
If you look at the “Annotation task formats” section in your PRODIGY_README.html
, you’ll find the exact JSON format that Prodigy expects for pre-annotated data for the different annotation types (NER, text classification etc.). The format should be pretty straightforward: for each example, you usually have a "text"
and then either a "label"
or "spans"
, depending on what you’re annotating. You can then convert your pre-annotated data accordingly. For example, for named entity recognition, you’ll need the text and the start/end character offsets and labels for the entities in that text.
Many thanks for your responses, for test,
I read my raw data by prodigy and also add arbitrary labels ( I defined by --label in ner.manual) to that. It shows I am able to do annotation on raw data for an arbitrary set of labels. here I used a json file containing the sentence of my data and then converted to jsonl and every thing was ok.
Now again to my question, since I did the special tokenization on my data by regex and also annotated the data by regex,
I have the annotated data in this format in python:
[[(‘Therefore’, ‘None’),
(‘CD’, ‘GEOM’),
(‘being’, ‘None’),
(‘dropped’, ‘None’),
(‘perpendicular’, ‘None’),
(‘to’, ‘None’),
(‘AB’, ‘GEOM’),
(‘where’, ‘None’),
(‘AD’, ‘GEOM’),
(‘which’, ‘None’),
(‘is’, ‘None’),
(‘half’, ‘None’),
(‘AB’, ‘GEOM’),
(‘is’, ‘None’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘will’, ‘None’),
(‘be’, ‘None’),
(‘3333⅓’, ‘NUM’)],
[(‘Looking’, ‘None’),
(‘this’, ‘None’),
(‘up’, ‘None’),
(‘in’, ‘None’),
(‘a’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’),
(‘we’, ‘None’),
(‘find’, ‘None’),
(‘the’, ‘None’),
(‘angles’, ‘None’),
(‘CAD’, ‘GEOM’),
(‘and’, ‘None’),
(‘CBD’, ‘GEOM’),
(‘to’, ‘None’),
(‘be’, ‘None’),
(“72° 33’”, ‘COORD’)],
[(‘So’, ‘None’),
(‘also’, ‘None’),
(‘at’, ‘None’),
(‘16°’, ‘ANG’),
(‘or’, ‘None’),
(‘17°’, ‘ANG’),
(‘Aquarius’, ‘None’),
(‘with’, ‘None’),
(‘AB’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘1375’, ‘NUM’),
(‘so’, ‘None’),
(‘if’, ‘None’),
(‘AD’, ‘GEOM’),
(‘1000’, ‘NUM’),
(‘AC’, ‘GEOM’),
(‘is’, ‘None’),
(‘2750’, ‘NUM’),
(‘showing’, ‘None’),
(“68° 40’”, ‘COORD’),
(‘in’, ‘None’),
(‘the’, ‘None’),
(‘table’, ‘None’),
(‘of’, ‘None’),
(‘secants’, ‘None’)]]
do you have any suggestion that how can I proceed form here? probably I should make a same format that you mentioned , is there any option that I can use prodigy in any thought
many thanks
Yes, this looks good – now you can write a small function that takes your tokens and outputs them as a dictionary with a "text"
, "tokens"
and "spans"
. Do you still have the original text with whitespace? Otherwise, you’ll have to reconstruct that by concatenating the token texts.
Everyone’s raw data is different, so there’s no converter that takes exactly what you have and outputs JSON. But Prodigy standardises on a pretty straightforward format, so hopefully it shouldn’t be too difficult to write a function that converts your annotations in Python or any other programming language you like.
If there’s something you can automate (for example, with regex), you definitely want to take advantage of that! The more you can automate or pre-select, the better. This saves you time and reduces the potential for human error
Here’s a quick example for a conversion script in Python – I haven’t tested it yet but something like this should work. You take a bunch of regular expressions, match them on all your texts, get the start and end character index and format them as "spans"
in Prodigy’s format. At the end, you can export the data to a file data.jsonl
.
import re
from prodigy.util import write_jsonl
label = "LABEL" # whatever label you want to use
texts = [] # a list of your texts
regex_patterns = [
# your expressions – whatever you need
re.compile(r"(?:[0-9a-fA-F]{2}[-:]){5}(?:[0-9a-fA-F]{2})")
]
examples = []
for text in texts:
for expression in regex_patterns:
spans = []
for match in re.finditer(expression, text):
start, end = match.span()
span = {"start": start, "end": end, "label": label}
spans.append(span)
task = {"text": text, "spans": spans}
examples.append(task)
write_jsonl("data.jsonl", examples)
Now I solved it BY REGEX ! : ) Now,
I have three version of data annotated by three different lables.( it will be revised by annotator on Thursday)
Now before editing by annotaor I used (is it correct?) :
python -m prodigy db-in ner_date_01 NER_date_01.jsonl
python –m prodigy ner.batch-train NER_ASTR_01.jsonl
en_core_web_sm --ASTR --output C:\Users\moha\Documents\Prodigy\model --n-iter 10 --eval-split 0.2 --dropout 0.2
I have kind of result
BEFORE 0.500
Correct 10
Incorrect 10
Entities 2768
Unknown 2758
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 11.611 18 2 1596 0 0.900
02 8.178 18 2 1456 0 0.900
03 5.282 19 1 889 0 0.950
04 4.834 18 2 1117 0 0.900
05 3.829 18 2 1047 0 0.900
06 3.615 20 0 947 0 1.000
could you let me know that I am in correct direction?
should I do it after edit by annotater?
what is the next step, should I merge the three data sets (with different lables) to one?
I am very excited now
Best
Nice to hear! And yes, this looks good! The results you're seeing right now (90%-100% accuracy) are a bit misleading – it's because you do not have any negative examples. Once you're done with correcting the annotations and have the full dataset you want to train on, you can add --no-missing
to ner.batch-train
. This will treat all unlabelled tokens as "outside of an entity" (instead of missing values) and give you more reliable results.
Yes, exactly! If you're using the latest Prodigy v1.8, you can also use the db-merge
command to merge datasets automatically.
Good luck!
Dear Innes,
thank you for your responses, So basically I annotator should correct then each labels and then she should send me jsonl file and then I should run
python -m prodigy db-in ner_date_01 NER_date_01.jsonl
python –m prodigy ner.batch-train NER_ASTR_01.jsonl
en_core_web_sm --ASTR --output C:\Users\moha\Documents\Prodigy\model --n-iter 10 --eval-split 0.2 --dropout 0.2
saved the improved result for each labels and then merge them?
next question, is updating straight-forward? could you let me know the big difference in new version and optimized instruction of update?
last question, kind of related more on NLP?
I want to know your idea about "policy of correcting the annotations "
Imagine my annotator faced with (I have three lables ASTR, DATE, TIME
136,918,
or
8h 20m,
or
20 Mrach/Febraury 1590
in each case what she should choose as correct label? I mean should she choose for example
136,918
or
136,91, WITH COMA
Last question do you have any suggestion after NER, how can I proceed with semantic analysis based on my annotation? is it better to use Spacy, or you have any other suggestion?
tnx, I am using prodgy 1.71 but I have access to new one 1.8.1.
if I update after correcting annotating all my data , can I use db.merge on those data?
I mean, If I update prodigy it is ok with annotated file by previous version?
Hey Innes,
I did ner.batch.train on my corrected annotation and here is the result
Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2 Batch size: 16 Iterations: 10
BEFORE 0.007
Correct 21
Incorrect 2780
Entities 2768
Unknown 0
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 2.293 26 48 46 0 0.351
02 1.070 31 39 47 0 0.443
03 0.733 34 38 52 0 0.472
04 0.809 36 45 63 0 0.444
05 0.600 35 44 60 0 0.443
06 0.460 36 43 61 0 0.456
07 0.541 37 46 66 0 0.446
08 0.418 38 45 67 0 0.458
09 0.394 32 45 55 0 0.416
10 0.475 34 44 58 0 0.436
Correct 34
Incorrect 38
Baseline 0.007
Accuracy 0.472
Model: C:\Users\moha\Documents\Prodigy\model_date_02
Training data: C:\Users\moha\Documents\Prodigy\model_date_02\training.jsonl
Evaluation data: C:\Users\moha\Documents\Prodigy\model_date_02\evaluation.jsonl
I got kind of confused, what is actually this model? specially what is this part about
Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2 Batch size: 16 Iterations: 10
BEFORE 0.007
Correct 21
Incorrect 2780
Entities 2768
Unknown 0
how can I use the result for having better annotation? I am familiar with DL and ML , could you please let me know what do you mean here of accuracy? since we only have one lablel “DATE” what shows this results?
As I mentioned, I am trying to correct the pre-annotated (by regex) data by prodigy label by label.
Now I have done with correcting pre-annotated date related to label -DATE. then I used
Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2 Batch size: 16 Iterations: 30
BEFORE 0.007
Correct 21
Incorrect 2780
Entities 2768
Unknown 0
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 2.293 26 48 46 0 0.351
02 1.070 31 39 47 0 0.443
03 0.733 34 38 52 0 0.472
04 0.809 36 45 63 0 0.444
05 0.600 35 44 60 0 0.443
06 0.460 36 43 61 0 0.456
07 0.541 37 46 66 0 0.446
08 0.418 38 45 67 0 0.458
09 0.394 32 45 55 0 0.416
10 0.475 34 44 58 0 0.436
11 0.297 32 50 60 0 0.390
12 0.363 35 42 58 0 0.455
13 0.338 34 40 54 0 0.459
14 0.332 35 42 58 0 0.455
15 0.352 35 47 63 0 0.427
16 0.376 36 44 62 0 0.450
17 0.529 37 42 62 0 0.468
18 0.496 37 44 64 0 0.457
19 0.286 37 44 64 0 0.457
20 0.212 34 47 61 0 0.420
21 0.231 34 44 58 0 0.436
22 0.461 31 49 57 0 0.387
23 0.392 31 45 53 0 0.408
24 0.353 32 41 51 0 0.438
25 0.330 31 49 57 0 0.387
26 0.205 32 48 58 0 0.400
27 0.266 35 46 62 0 0.432
28 0.383 32 52 62 0 0.381
29 0.204 32 48 58 0 0.400
30 0.273 30 46 52 0 0.395
Correct 34
Incorrect 38
Baseline 0.007
Accuracy 0.472
I do not understand why we have
‘’’
Correct 34
Incorrect 38
‘’’
I am sure we have more entities as “DATE” at least 1500
IS IT NECCCERAY This step at all?
next question, imagine I have all my annotated data for labels “ASTR” and “TIME”
if this step is not meaningful, I only need to merge them?
what would be the next step?
my aim is to have a labeled data of all three entity (which is improved by annotator and also prodigy) then I want to do some semantic analysis on my corpus.
If you have three datasets where the same texts are annotated with one label each, you need to get the examples merged so that all three labels are annotated on each example. You should be able to do this automatically, unless there are conflicts in the annotations, in which case you need to resolve those somehow. Prodigy v1.8 has some useful functions for this: the db-merge
recipe is one, and also the review interface might be useful as well, especially with a custom recipe.
If you don’t have your texts annotated with all of your labels, it’s still possible to train a model from them, as Prodigy supports texts with missing annotations. This is why the accuracy metrics might be surprising: if you don’t use the --no-missing
flag, the model may be making predictions which the annotations do not mark as correct or incorrect.
I know this doesn’t answer all of your questions, but counting back over your last messages, I see 15-20 different questions, depending on how I count them! Even at a few minutes per question, I’m sure you can see that it would take a long time to work through everything you’ve asked.
Since you’re on a research license, I hope that some of your colleagues will be able to help you with the more project-oriented questions about how to approach things like semantic analysis, how to refine your annotation scheme, etc.
many thanks for response, that right, I asked a lot of questions. sorry for those question, however some them related to prodigy.
just regarding annotation this recipe
python -m prodigy ner.batch-train an_ner_astr_01 en_core_web_sm --output \model_astr_01 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing
what is supposed to do?
I mean, I annotated the data by regex, annotator correct those (means add some annotation or edit in prodigy interface, however as we understand after each correction or…she should always press green button) then in this label which is astronomical size (basically 5 or 6 digit number) could you a bit elaborate ner.batch-train in my case? I have this result
(base) C:\Users\moha\Documents\Prodigy>python -m prodigy ner.batch-train an_ner_astr_01 en_core_web_sm --output \model_astr_01 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing
Loaded model en_core_web_sm
Using 20% of accept/reject examples (1330) for evaluation
Using 100% of remaining examples (5332) for training
Dropout: 0.2 Batch size: 16 Iterations: 10
BEFORE 0.000
Correct 0
Incorrect 2919
Entities 2768
Unknown 0
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 96.338 108 60 125 0 0.643
02 95.158 134 32 149 0 0.807
03 95.202 141 22 153 0 0.865
04 94.795 139 21 148 0 0.869
05 95.464 139 21 148 0 0.869
06 93.525 138 21 146 0 0.868
07 93.129 138 24 149 0 0.852
08 94.345 138 25 150 0 0.847
09 94.080 138 24 149 0 0.852
10 95.058 139 24 151 0 0.853
Correct 139
Incorrect 21
Baseline 0.000
Accuracy 0.869
Model: C:\model_astr_01
Training data: C:\model_astr_01\training.jsonl
Evaluation data: C:\model_astr_01\evaluation.jsonl
ner.batch-train
will call into spaCy and train a model with your data. This is done in a very similar way to the regular training with spaCy. It then shows you the training results.
Here, you have 6662 sentences and 20% of them are held back for evaluation. Before training, the model didn’t know anything, and after training, the accuracy is 0.8690 which is very promising. You’ve only trained for 10 iterations and the loss is going down. All of this would indicate that your model is able to learn the task.
that great! Many thanks for your prompt response, I know that it is very nice of you to answer my questions, I am very excited now and happy to used Prodigy. Now I have these questions
1-can you interpret a bit the “correct” and incorrect?
2-It is great that I have a nice accuracy here, can you let me know when for some labels accuracy is around .70, how can I improve it in prodigy? only by adding iteration, or changing drop out? there is no way to see the structure of network and change it , or should we back to annotation more
3-Imagine I add all labels and each of then has accuracy around .90, can we say the model for merged file also has the same accuracy?
Now, I get the answer of my first question :), I guess the correct after using model refer to testing set and before it is all entities . Maybe it would be more clear if you separate the result based on training set and testing set …many tnx
I’m sorry but we really can’t provide this level of support, especially for a free research license. If your colleagues can’t help you, you might be best off looking for a paid consultant. You could post a request here: spaCy/prodigy consultants?
thank you for your response. I can see. I mange most of the steps by your responses and my effort. Now I have nice results for most of the labels, if ti was necessary , for sure