The text is split to several text chunks while using ner.make_gold

usage
ner

(lz-chen) #1

Hi, I have some NER tagging obtained from some model in JSON format. E.g.

{"text": "ServiceNow Inc. (NOW) PT Raised to $90.00 [Zolmax News]", "spans": [{"start": 0, "end": 17, "token_start": 0, "token_end": 2, "label": "ORG"}], "answer": "accept"}

I wanted to correct the annotations in prodigy and create a gold set. So I used ner.make_gold but after I exported the corrected annotation to a jsonl file, the text in one example was split into several examples. For instance, this is what I got:

{"text":"ServiceNow Inc. (NOW","spans":[{"token_start":0,"token_end":1,"start":0,"end":15,"text":"ServiceNow Inc.","label":"ORG","source":"en_core_web_sm","input_hash":66795106}],"answer":"accept","_input_hash":66795106,"_task_hash":-1916698682,"tokens":[{"text":"ServiceNow","start":0,"end":10,"id":0},{"text":"Inc.","start":11,"end":15,"id":1},{"text":"(","start":16,"end":17,"id":2},{"text":"NOW","start":17,"end":20,"id":3}]}
{"text":") PT Raised to $90.00 [Zolmax News]","spans":[],"answer":"accept","_input_hash":537309313,"_task_hash":-1431592704,"tokens":[{"text":")","start":0,"end":1,"id":0},{"text":"PT","start":2,"end":4,"id":1},{"text":"Raised","start":5,"end":11,"id":2},{"text":"to","start":12,"end":14,"id":3},{"text":"$","start":15,"end":16,"id":4},{"text":"90.00","start":16,"end":21,"id":5},{"text":"[","start":22,"end":23,"id":6},{"text":"Zolmax","start":23,"end":29,"id":7},{"text":"News","start":30,"end":34,"id":8},{"text":"]","start":34,"end":35,"id":9}]}

How can I keep the text as one example in this case?
Thanks a lot!


(Ines Montani) #2

Hi! By default, Prodigy will use spaCy to segment the incoming text into sentences. But you can disable this by setting --unsegmented on the command line. If you’re also using Prodigy to train a model, make sure to also set --unsegmented on ner.batch-train.


(lz-chen) #3

Hi Ines, thanks for the reply! But if now I have this segmented annotations stored in the database, is there an easy way for me to get the unsegmented annotation spans without re-annotating everything again with --unsegmented?


(Ines Montani) #4

Do you know which examples belong together? If you do, I guess you could write a script that merges the examples and rewrites the character offsets. So, for each span in the second entry, you add the length of the previous text to the start and end offset, and so on. The example you posted is easy, because the second entry has no spans. So it can just become:

{
    "text": "ServiceNow Inc. (NOW) PT Raised to $90.00 [Zolmax News]",
    "spans": [{"start": 0, "end": 15, "label": "ORG"}]
}

It can potentially get a little messy, though, and it’s easy to make off-by-one errors. So you might have to do some manual correction.


(lz-chen) #5

Probably it is easier for me to just re-annotate since I don’t have too many examples. Thank you very much for the help!