have a same format after merging.

Dear Ines,

Moin! as mentioned I have made a refined named entity recognition. I made it entity by entity then I merged all together. before merging the format of my data is

{"text":"Therefore, as 100,000 is to 70,711, so is 7560 to 5346, the sine of the arc of 3\u00b0 4' 52\", which is CEB.","spans":[{"start":79,"end":88,"label":"LONG"}]}
{"text":"Subtracting this from 45\u00b0 leaves CBE, 41\u00b0 55' 8\", whose half is 20\u00b0 57' 34\", the tangent to which arc is 38,304.","spans":[{"start":38,"end":48,"label":"LONG"},{"start":64,"end":75,"label":"LONG"}]}

Means:

{"text":"text","spans":[{"start":79,"end":88,"label":"LABEL"}]}

but after merging I can not have the same fomat
i have something like this:

{"text":"Therefore, as 100,000 is to 70,711, so is 7560 to 5346, the sine of the arc of 3\u00b0 4' 52\", which is CEB.","spans":[{"start":14,"end":21,"label":"PARA","answer":"accept"},{"start":79,"end":88,"label":"LONG","answer":"accept"}],"_input_hash":16621573,"_task_hash":-706401107,"answer":"accept"}

which means

I feel that I should somehow do
db-in

maybe before the modification of annotation ...

I have merged my data in this way

from prodigy.components.db import connect
from prodigy.models.ner import merge_spans

db = connect()  # connect to the DB using the prodigy.json settings
datasets = ['ner_date_v02','ner_time_v02','ner_para_v05','ner_astr_v03','ner_long_v10','ner_star_v02','ner_plan_v02','ner_name_v02','ner_geom_v01']
examples = []
for dataset in datasets:
    examples += db.get_dataset(dataset)  # get examples from the database

merged_examples = merge_spans(examples)
from prodigy import set_hashes

merged_examples = [set_hashes(eg, overwrite=True) for eg in merged_examples]
db.add_dataset('data_merged_v12')
db.add_examples(merged_examples, datasets=['data_merged_v12'])

the reason of this setup is that I want to calculate the metric of NER for each entity based on this

Hi! Sorry – I'm not sure I understand the question correctly! What's the output you expect after merging?

1 Like

thank you for your message, as you see at the first example here

text":"Therefore, as 100,000 is to 70,711, so is 7560 to 5346, the sine of the arc of 3\u00b0 4' 52\", which is CEB.","spans":[{"start":79,"end":88,"label":"LONG"}]}
{"text":"Subtracting this from 45\u00b0 leaves CBE, 41\u00b0 55' 8\", whose half is 20\u00b0 57' 34\", the tangent to which arc is 38,304.","spans":[{"start":38,"end":48,"label":"LONG"},{"start":64,"end":75,"label":"LONG"}]}

which is

{"text":"text","spans":[{"start":79,"end":88,"label":"LABEL"}]}

but after merging I have extra term like this:

"_input_hash":16621573,"_task_hash":-706401107,"answer":"accept"

I simply want to have such output :

text":"text","spans":[{"start":79,"end":88,"label":"LABEL01"}],[{"start":79,"end":88,"label":"LABEL02"}]}

My aim is to follow this link

and calculate the other metric of my NER for each class

many thanks

This is metadata that Prodigy always adds when you import examples to the database. The input and task hashes help identify annotations on the same input and the answer is used to determine whether the example should be used for updating the model. If you don't need them in your file, you can just write a script that deletes them.

And just a note: the NER entity per-type evaluation is now included in spacy's Scorer directly (typically obtained by calling nlp.evaluate()), so you shouldn't need the alternate solution from StackOverflow.

1 Like

thank you for your response, I have made a refined model using prodigy 1.8.3 and spacy 2.1.8

I have tried to do it,but it need I guess
docs_golds ,

can you give me a hint how can I proceed and make doc-gold ? pr do you have any other idea?

thank you for the prompt response, I wrote this and it works!

path= '../data/DATA_MERGED_V12.jsonl'
with jsonlines.open(path) as reader:
    a=[]
    for example in reader:
        if example["spans"]==[]:
            del example["_input_hash"],example["_task_hash"],example["answer"]
        else:
            del example["_input_hash"],example["_task_hash"],example["answer"]
            for lst in example["spans"]:
                
                del lst["answer"]
        a.append(example)
write_jsonl("DATE_MERGED_EDIT_V14.jsonl", a)
        
    

Now I need to kind of use this as example in this script

and try to compute metrics! however I still need to convert the data from my format here:

{"text":"As shown in Figure 2B, the Sun is assumed to be at the center of the planetary system.","spans":[{"start":27,"end":30,"label":"PLAN"}]}

to this format :

("As shown in Figure 2B, the Sun is assumed to be at the center of the planetary system.",{"entities": [(27, 30, 'PLAN')]})

Am I moving in the correct direction? I guess if I do this then I can make this golden data and used this link:

to evaluate the performance for each entity?

basically I have written my question here

https://stackoverflow.com/questions/58376213/calculate-all-the-metrics-of-a-custom-named-entity-recognition-nermodel-using

last update of today is this one:

basically I manage to have something very similar:

c=[]
for b in a:
    d={}
    d['text']=b["text"]    
    for lst in b["spans"]:
        d.setdefault('entities', [])
        d['entities'].append((lst["start"],lst["end"],lst["label"]))
    c.append(d)

the result is

[{'text': 'To find the position of Mars at opposition, Kepler computed the angular distance that Mars and Earth—now substituting the place of the Sun—moved during 17 hours 20 minutes; Mars moved eastward about 16\' 20" and the Sun westward about 42\' 18".',
  'entities': [(152, 171, 'TIME'),
   (32, 42, 'ASTR'),
   (184, 192, 'ASTR'),
   (199, 206, 'LONG'),
   (234, 241, 'LONG'),
   (24, 28, 'PLAN'),
   (86, 90, 'PLAN'),
   (95, 100, 'PLAN'),
   (135, 138, 'PLAN'),
   (173, 177, 'PLAN'),
   (215, 218, 'PLAN')]},
 {'text': 'Accordingly, Kepler determined the longitude of Mars at opposition to be 198° 37\' 50" from which he subtracted about 39" in order to correct Mars\'s orbit; he got 198° 37\' 10" (18° 37\' 10" Libra).',
  'entities': [(35, 44, 'ASTR'),
   (56, 66, 'ASTR'),
   (148, 153, 'ASTR'),
   (73, 85, 'LONG'),
   (162, 174, 'LONG'),
   (176, 193, 'LONG'),
   (48, 52, 'PLAN'),
   (141, 145, 'PLAN')]},
 {'text': 'The Sun moved westward and its longitude decreased from the time of observation to its position opposite to Mars.',
  'entities': [(31, 40, 'ASTR'),
   (68, 79, 'ASTR'),
   (4, 7, 'PLAN'),
   (108, 112, 'PLAN')]},
 {'text': 'Therefore, the time of opposition is 17 hours 20 minutes before March 29, at 21:43, the time when the observation was made.',
  'entities': [(64, 72, 'DATE'),
   (37, 56, 'TIME'),
   (23, 33, 'ASTR'),
   (102, 113, 'ASTR')]}]

I only need to kind of delete text and osme other changes to make it exatly in this format: tuple with two members, one string and one dictoinary!

("As shown in Figure 2B, the Sun is assumed to be at the center of the planetary system.",{"entities": [(27, 30, 'PLAN')]})

here is my last step:

Sorry for bothering I have worte a script to convert my jsonl data roicde by priodigy to example format of your like this

path= '../data/DATA_MERGED_EDIT_V15.jsonl'
c=[]
with jsonlines.open(path) as reader:
    for example in reader:
        d={}
        d['text']=example["text"]    
        for lst in b["spans"]:
            d.setdefault('entities', [])
            d['entities'].append((lst["start"],lst["end"],lst["label"]))
        l = [ [k,v] for k, v in d.items() ]
        del l[0][0]
        l=sum(l, [])
        l=tuple(l)
        txt=l[0]
        ent=dict([l[1:]])
        j=k+(txt,)+(ent,)    
        c.append(j)

I have almost same format with you with a tiny difference

[('On the distinction between the first motion and the second or proper motions; and in the proper motions, between the first and the second inequality.',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ('The testimony of the ages confirms that the motions of the planets are orbicular.',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ('Reason, having borrowed from experience, immediately presumes this: that their gyrations are perfect circles.',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ('For among figures it is circles, and among bodies the heavens, that are considered the most perfect.',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ('However, when experience seems to teach something different to those who pay careful attention, namely, that the planets deviate from a simple circular path, it gives rise to a powerful sense of wonder, which at length drives people to look into causes.',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ('It is just this from which astronomy arose among humans.',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ("Astronomy's aim is considered to be to show why the stars' motions appear to be irregular on earth, despite their being exceedingly well ordered in heaven, and to investigate the specific circles whereby the stars may be moved, so that by their aid the positions and appearances of those stars at any given time may be predicted.",
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ('Before the distinction between the first motion(1) and the second motions(2) was established, people noted (in contemplating the sun, moon and stars) that their diurnal paths were visually very nearly equivalent to circles.',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]}),
 ('These were, however, entwined one upon another like yarn on a ball, and the circles were for the most part smaller(3) circles of the sphere, rarely the greatest(4) (such',
  {'entities': [(14, 21, 'PARA'), (42, 46, 'PARA'), (79, 88, 'LONG')]})]

do you think cna i use this for ecaluation aim ?

if not,can you give me hint to convert to desired format like this:

examples = [
    ("Trump says he's answered Mueller's Russia inquiry questions \u2013 live",{"entities":[[0,5,"PERSON"],[25,32,"PERSON"],[35,41,"GPE"]]}),
    ("Alexander Zverev reaches ATP Finals semis then reminds Lendl who is boss",{"entities":[[0,16,"PERSON"],[55,60,"PERSON"]]}),
    ("Britain's worst landlord to take nine years to pay off string of fines",{"entities":[[0,7,"GPE"]]}),
    ("Tom Watson: people's vote more likely given weakness of May's position",{"entities":[[0,10,"PERSON"],[56,59,"PERSON"]]}),
]

Hi Roberto,

I'm late to this thread so apologies if I've missed some of the context. But I think two functions should help you, one in Prodigy and another in spaCy.

The prodigy ner.gold-to-spacy command outputs the simple jsonl format used by spaCy's NER example scripts. If you have a dataset in Prodigy and want to produce the format from your most recent example, I think that command should be helpful.

Another option is to use prodigy db-out and get a jsonl in Prodigy's format, and then use the spacy convert --converter jsonl command. This will output spaCy's full json format, which will allow you to use the spacy train command.

As of v2.2, the spaCy train command now outputs full per entity-type statistics in the accuracy.json file that's saved into each model directory during training. However, the current version of Prodigy only supports spaCy v2.1. We should have a new version of Prodigy out shortly that supports the new version. What you can do currently though is just make a new environment and install an up-to-date version of spaCy. This will let you train a model with spacy train, which is often the easiest way to do things.

Hope that helps!

1 Like

thank you for your response, as you saw I was almost done to write my own script for that :slight_smile:

I actually used

python -m prodigy ner.gold-to-spacy data_merged_v14 DATA_MERGED_GOLD_V14.jsonl

It worked perfectly fine and converted my prodigy data to the gold format

now how can I have all the metrics for all entities? I have tried this

import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(nlp, examples, ent='TIME'):
    scorer = Scorer()
    for input_, annot in examples:
        text_entities = []
        for entity in annot.get('entities'):
            if ent in entity:
                text_entities.append(entity)
        doc_gold_text = make
        input_)
        gold = GoldParse(doc_gold_text, entities=text_entities)
        pred_value = nlp(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

it gives me the mertric per entity:

{'uas': 0.0, 'las': 0.0, 'ents_p': 2.012092377837248, 'ents_r': 97.1291866028708, 'ents_f': 3.9425131093416192, 'ents_per_type': {'ASTR': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'GEOM': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'NAME': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'PLAN': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'PARA': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'LONG': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'DATE': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'TIME': {'p': 98.06763285024155, 'r': 97.1291866028708, 'f': 97.59615384615384}, 'STAR': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'tags_acc': 0.0, 'token_acc': 100.0}

it seems this is not correct. then I tried this:

scorer = nlp.evaluate(example, verbose=False)
print(scorer.scores)

then I got this error:

[E067] Invalid BILUO tag sequence: Got a tag starting with 'I' (inside an entity) without a preceding 'B' (beginning of an entity). Tag sequence:
['O', 'O', 'O', 'U-PARA', 'I-DATE']

do you have any idea how can I have all metrics values for all entities using my gold data and model?

Last update i have used this

import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def Eval(examples):
    # test the saved model
    print("Loading from", './Model_U27/')
    model_path= '../data/Model_U27'
    ner_model=spacy.load(model_path)
    scorer = Scorer()
    try:
        for input_, annot in examples:
            doc_gold_text = ner_model.make_doc(input_)
            gold = GoldParse(doc_gold_text, entities=annot['entities'])
            pred_value = ner_model(input_)
            scorer.score(pred_value, gold)
    except Exception as e: print(e)
    print(scorer.scores)

It almost works! and it gives me this result:

Loading from ./model6/
[E067] Invalid BILUO tag sequence: Got a tag starting with 'I' (inside an entity) without a preceding 'B' (beginning of an entity). Tag sequence:
['O', 'O', 'O', 'U-PARA', 'I-DATE']
{'uas': 0.0, 'las': 0.0, 'ents_p': 99.54797616601603, 'ents_r': 99.54797616601603, 'ents_f': 99.54797616601603, 'ents_per_type': {'ASTR': {'p': 99.80992608236537, 'r': 99.9154334038055, 'f': 99.86265187533017}, 'GEOM': {'p': 100.0, 'r': 99.85007496251875, 'f': 99.92498124531133}, 'NAME': {'p': 100.0, 'r': 99.25742574257426, 'f': 99.62732919254658}, 'PLAN': {'p': 99.88571428571429, 'r': 99.77168949771689, 'f': 99.82866933181039}, 'PARA': {'p': 98.51598173515981, 'r': 99.76878612716763, 'f': 99.13842619184378}, 'LONG': {'p': 99.7651203758074, 'r': 99.88242210464433, 'f': 99.82373678025851}, 'DATE': {'p': 99.5049504950495, 'r': 99.5049504950495, 'f': 99.5049504950495}, 'TIME': {'p': 97.97979797979798, 'r': 97.0, 'f': 97.48743718592964}, 'STAR': {'p': 84.61538461538461, 'r': 74.15730337078652, 'f': 79.04191616766467}}, 'tags_acc': 0.0, 'token_acc': 100.0}

as you see the result for each entity looks great! but there is a error also