The text is split to several text chunks while using ner.make_gold

lz-chen · March 12, 2019, 10:41am

Hi, I have some NER tagging obtained from some model in JSON format. E.g.

{"text": "ServiceNow Inc. (NOW) PT Raised to $90.00 [Zolmax News]", "spans": [{"start": 0, "end": 17, "token_start": 0, "token_end": 2, "label": "ORG"}], "answer": "accept"}

I wanted to correct the annotations in prodigy and create a gold set. So I used ner.make_gold but after I exported the corrected annotation to a jsonl file, the text in one example was split into several examples. For instance, this is what I got:

{"text":"ServiceNow Inc. (NOW","spans":[{"token_start":0,"token_end":1,"start":0,"end":15,"text":"ServiceNow Inc.","label":"ORG","source":"en_core_web_sm","input_hash":66795106}],"answer":"accept","_input_hash":66795106,"_task_hash":-1916698682,"tokens":[{"text":"ServiceNow","start":0,"end":10,"id":0},{"text":"Inc.","start":11,"end":15,"id":1},{"text":"(","start":16,"end":17,"id":2},{"text":"NOW","start":17,"end":20,"id":3}]}
{"text":") PT Raised to $90.00 [Zolmax News]","spans":[],"answer":"accept","_input_hash":537309313,"_task_hash":-1431592704,"tokens":[{"text":")","start":0,"end":1,"id":0},{"text":"PT","start":2,"end":4,"id":1},{"text":"Raised","start":5,"end":11,"id":2},{"text":"to","start":12,"end":14,"id":3},{"text":"$","start":15,"end":16,"id":4},{"text":"90.00","start":16,"end":21,"id":5},{"text":"[","start":22,"end":23,"id":6},{"text":"Zolmax","start":23,"end":29,"id":7},{"text":"News","start":30,"end":34,"id":8},{"text":"]","start":34,"end":35,"id":9}]}

How can I keep the text as one example in this case?
Thanks a lot!

ines · March 12, 2019, 10:48am

Hi! By default, Prodigy will use spaCy to segment the incoming text into sentences. But you can disable this by setting --unsegmented on the command line. If you’re also using Prodigy to train a model, make sure to also set --unsegmented on ner.batch-train.

lz-chen · March 12, 2019, 1:01pm

Hi Ines, thanks for the reply! But if now I have this segmented annotations stored in the database, is there an easy way for me to get the unsegmented annotation spans without re-annotating everything again with --unsegmented?

ines · March 12, 2019, 1:13pm

Do you know which examples belong together? If you do, I guess you could write a script that merges the examples and rewrites the character offsets. So, for each span in the second entry, you add the length of the previous text to the start and end offset, and so on. The example you posted is easy, because the second entry has no spans. So it can just become:

{
    "text": "ServiceNow Inc. (NOW) PT Raised to $90.00 [Zolmax News]",
    "spans": [{"start": 0, "end": 15, "label": "ORG"}]
}

It can potentially get a little messy, though, and it’s easy to make off-by-one errors. So you might have to do some manual correction.

lz-chen · March 12, 2019, 1:40pm

Probably it is easier for me to just re-annotate since I don’t have too many examples. Thank you very much for the help!

Topic		Replies	Views
Strange text segmentation with ner.teach recipe usage	7	596	September 9, 2019
How to split the paragraph into sentences after annotation ner	3	603	November 20, 2022
Implementing ner.correct says the model you are using isn't setting sentence boundaries ner , solved	8	363	July 24, 2023
Create a jsonl pre-populated with annoatations from .txt file usage , ner	4	1068	March 1, 2021
ner silver-to-gold resulted in annotating the same objects multiple times bug , ner	3	815	December 13, 2021

The text is split to several text chunks while using ner.make_gold

Related topics