Exporting highlighted text from ner.manual for use in PhraseMatcher

Hi,

I'm attempting to use Prodigy to quickly come up with patterns for later use with a spaCy PhraseMatcher with attr="SHAPE".

I labeled a couple hundred examples, and then was hoping the resulting highlights would be stored in a retrievable way. It appears that perhaps they are, but I'm not able to at the moment.

To reproduce:

Using the following command: prodigy ner.manual test_animal_fruit_example en_core_web_sm ./data/test.jsonl --label "Fruit Weight, Animal Weight" where test.jsonl is like so:

{"text": "Have you ever seen a 5kg apple in a tree with a cat"}
{"text": "A whole bunch of text is going on here about a apple of 5oz size in a tree with a cat"}
{"text": "Dancing horses and a 30lb mango in a tree with a cat"}
{"text": "We ate bananas so that the 500 kilo elephant would stay pleased"}
{"text": "Don't feed your cat too much or it will end up as a 60 pound kitty"}

I highlighted 5kg apple, apple of 5oz, 30lb mango, 500 kilo elephant, and 60 pound kitty with their respective labels (first 3 are Fruit Weight and final 2 are Animal Weight)

The resulting annotations file, after using db-out, is like so:

{"text":"Have you ever seen a 5kg apple in a tree with a cat","_input_hash":1015986919,"_task_hash":2059168639,"tokens":[{"text":"Have","start":0,"end":4,"id":0},{"text":"you","start":5,"end":8,"id":1},{"text":"ever","start":9,"end":13,"id":2},{"text":"seen","start":14,"end":18,"id":3},{"text":"a","start":19,"end":20,"id":4},{"text":"5","start":21,"end":22,"id":5},{"text":"kg","start":22,"end":24,"id":6},{"text":"apple","start":25,"end":30,"id":7},{"text":"in","start":31,"end":33,"id":8},{"text":"a","start":34,"end":35,"id":9},{"text":"tree","start":36,"end":40,"id":10},{"text":"with","start":41,"end":45,"id":11},{"text":"a","start":46,"end":47,"id":12},{"text":"cat","start":48,"end":51,"id":13}],"_session_id":"test_animal_fruit_example-default","_view_id":"ner_manual","spans":[{"start":21,"end":30,"token_start":5,"token_end":7,"label":"Fruit Weight"}],"answer":"accept"}
{"text":"A whole bunch of text is going on here about a apple of 5oz size in a tree with a cat","_input_hash":-1205931032,"_task_hash":552599497,"tokens":[{"text":"A","start":0,"end":1,"id":0},{"text":"whole","start":2,"end":7,"id":1},{"text":"bunch","start":8,"end":13,"id":2},{"text":"of","start":14,"end":16,"id":3},{"text":"text","start":17,"end":21,"id":4},{"text":"is","start":22,"end":24,"id":5},{"text":"going","start":25,"end":30,"id":6},{"text":"on","start":31,"end":33,"id":7},{"text":"here","start":34,"end":38,"id":8},{"text":"about","start":39,"end":44,"id":9},{"text":"a","start":45,"end":46,"id":10},{"text":"apple","start":47,"end":52,"id":11},{"text":"of","start":53,"end":55,"id":12},{"text":"5","start":56,"end":57,"id":13},{"text":"oz","start":57,"end":59,"id":14},{"text":"size","start":60,"end":64,"id":15},{"text":"in","start":65,"end":67,"id":16},{"text":"a","start":68,"end":69,"id":17},{"text":"tree","start":70,"end":74,"id":18},{"text":"with","start":75,"end":79,"id":19},{"text":"a","start":80,"end":81,"id":20},{"text":"cat","start":82,"end":85,"id":21}],"_session_id":"test_animal_fruit_example-default","_view_id":"ner_manual","spans":[{"start":47,"end":59,"token_start":11,"token_end":14,"label":"Fruit Weight"}],"answer":"accept"}
{"text":"Dancing horses and a 30lb mango in a tree with a cat","_input_hash":-843711838,"_task_hash":903131301,"tokens":[{"text":"Dancing","start":0,"end":7,"id":0},{"text":"horses","start":8,"end":14,"id":1},{"text":"and","start":15,"end":18,"id":2},{"text":"a","start":19,"end":20,"id":3},{"text":"30","start":21,"end":23,"id":4},{"text":"lb","start":23,"end":25,"id":5},{"text":"mango","start":26,"end":31,"id":6},{"text":"in","start":32,"end":34,"id":7},{"text":"a","start":35,"end":36,"id":8},{"text":"tree","start":37,"end":41,"id":9},{"text":"with","start":42,"end":46,"id":10},{"text":"a","start":47,"end":48,"id":11},{"text":"cat","start":49,"end":52,"id":12}],"_session_id":"test_animal_fruit_example-default","_view_id":"ner_manual","spans":[{"start":21,"end":31,"token_start":4,"token_end":6,"label":"Fruit Weight"}],"answer":"accept"}
{"text":"We ate bananas so that the 500 kilo elephant would stay pleased","_input_hash":-263246995,"_task_hash":1265699488,"tokens":[{"text":"We","start":0,"end":2,"id":0},{"text":"ate","start":3,"end":6,"id":1},{"text":"bananas","start":7,"end":14,"id":2},{"text":"so","start":15,"end":17,"id":3},{"text":"that","start":18,"end":22,"id":4},{"text":"the","start":23,"end":26,"id":5},{"text":"500","start":27,"end":30,"id":6},{"text":"kilo","start":31,"end":35,"id":7},{"text":"elephant","start":36,"end":44,"id":8},{"text":"would","start":45,"end":50,"id":9},{"text":"stay","start":51,"end":55,"id":10},{"text":"pleased","start":56,"end":63,"id":11}],"_session_id":"test_animal_fruit_example-default","_view_id":"ner_manual","spans":[{"start":27,"end":44,"token_start":6,"token_end":8,"label":"Animal Weight"}],"answer":"accept"}
{"text":"Don't feed your cat too much or it will end up as a 60 pound kitty","_input_hash":1571593311,"_task_hash":-532519224,"tokens":[{"text":"Do","start":0,"end":2,"id":0},{"text":"n't","start":2,"end":5,"id":1},{"text":"feed","start":6,"end":10,"id":2},{"text":"your","start":11,"end":15,"id":3},{"text":"cat","start":16,"end":19,"id":4},{"text":"too","start":20,"end":23,"id":5},{"text":"much","start":24,"end":28,"id":6},{"text":"or","start":29,"end":31,"id":7},{"text":"it","start":32,"end":34,"id":8},{"text":"will","start":35,"end":39,"id":9},{"text":"end","start":40,"end":43,"id":10},{"text":"up","start":44,"end":46,"id":11},{"text":"as","start":47,"end":49,"id":12},{"text":"a","start":50,"end":51,"id":13},{"text":"60","start":52,"end":54,"id":14},{"text":"pound","start":55,"end":60,"id":15},{"text":"kitty","start":61,"end":66,"id":16}],"_session_id":"test_animal_fruit_example-default","_view_id":"ner_manual","spans":[{"start":52,"end":66,"token_start":14,"token_end":16,"label":"Animal Weight"}],"answer":"accept"}

Now, I'm trying to figure out how to get only the text I highlighted, with the associated label back out. This is proving difficult. I have something like this:

for annotation in data:
    if annotation['answer'] == 'accept':
        doc = nlp(annotation['text'])
        try:
            if len(annotation['spans']) > 0:
                    highlighted_span_start = int(annotation['spans'][0]['start'])
                    highlighted_span_end = int(annotation['spans'][0]['end'])
                if annotation['spans'][0] == 'Fruit Weight':
                    print('Fruit Weight', doc[highlighted_span_start:highlighted_span_end])
                    print('*'*30)
                elif annotation['spans'][0] == 'Animal Weight':
                    print('Animal Weight', doc[highlighted_span_start:highlighted_span_end])
                    print('*'*30)
        except KeyError as e:
            pass

This does not work, and using token_start and token_end for as the highlighted_span_ does not work either.

I assume there's a simple method here for getting the highlighted strings back out from ner.manual alongside their associated labels, but I haven't figured it out.

Thanks for any assistance.

Hi! I think you might be slightly overthinking this and it should be much easier :slightly_smiling_face: Ultimately, what you need is the highlighted text and the label, right? The span's start and end are the character offsets into the text, so you typically don't need spaCy at all to do the extraction and instead index into the text string (text[start:end]). Basically, like this:

for annotation in data: 
    if annotation["answer"] == "accept":
        text = annotation["text"]
        for span in annotation.get("spans", []):
            print(text[span["start"]:span["end"]], span["label"])

To convert them into patterns, you can call nlp.make_doc on the sliced text. (There's no need to run the full nlp object with all pipeline components, since the shape is a lexical attribute that all tokens have by default – it doesn't need a model's predictions.)

1 Like

Not sure why I didn't ask myself if the start and end were plain ol' string indices. Thanks so much for your help!