Review recipe for relationships not grouping same annotations

Hi,

I am reviewing a relationships dataset created with rel.manual and annotated by 3 annotators. I noticed the recipe wouldn't group identical annotations due to the order in which the were made. For example, in the attached image, there are two entities: ATF2 and ATF3. There is a third that one that is not visible. The entity ATF3 has relationships with the two other entities. The annotators annotated the same relationships between the same entities. However, one annotator annotated the "Positive_Regulation" first, while the other 2 annotators first annotated the other relationship. This can be seen in ATF3 having the same two colors in both annotations, but the order of the colors is different because the order of annotation of the relationships was different.

How can I group annotated examples that have the same annotations regardless the order in which they were made?

Thanks

Hi @ale,

Great catch! This is happening because the data structure that stores the relations preserves the order which is then preserved by the hashing function resulting in diff view.

We have already fixed it and lined it up for the upcoming patch release early next week!

1 Like

Hi @ale, just a heads up that it will take a bit longer to get the patch. So I thought that in the meantime I could just share the fix that you can apply to your local copy of the reviewrecipe.

Basically, we need to sort the the list of relation annotations before computing the _task_hash. If you add the following snippet:

if eg[VIEW_ID_ATTR] == "relations":
    relations = eg.get("relations")
    if relations:
        sorted_relations = sorted(relations, key=lambda x: (x["head"], x["child"]))
        eg["relations"] = sorted_relations

before line 152 in your review.py, you would be rendering the diffs correctly.
Your local copy of the recipe is located at prodigy/recipes/review.py relative to your Prodigy installation path thay can by found by running prodigy stats

Thanks @magdaaniol!

I tried this solution but it doesn't seem to be working for me, perhaps I'm missing something. Can you help me understand why the fix isn't working?

The jsonl from the example above is:

{"text":"Overexpression of ATF2 resulted in significant increase in ATF3 promoter activity, and electrophoretic mobility shift assay identified this region as a core sequence to which ATF2 binds.","meta":{"sentence_uid":"20581861:8","source":"ExTRI"},"_input_hash":2007673963,"_task_hash":367958513,"_is_binary":false,"spans":[{"start":18,"end":22,"token_start":2,"token_end":2,"label":"GENETIC"},{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},{"start":175,"end":179,"token_start":26,"token_end":26,"label":"GENETIC"}],"tokens":[{"text":"Overexpression","start":0,"end":14,"id":0,"ws":true,"disabled":false},{"text":"of","start":15,"end":17,"id":1,"ws":true,"disabled":false},{"text":"ATF2","start":18,"end":22,"id":2,"ws":true,"disabled":false},{"text":"resulted","start":23,"end":31,"id":3,"ws":true,"disabled":false},{"text":"in","start":32,"end":34,"id":4,"ws":true,"disabled":false},{"text":"significant","start":35,"end":46,"id":5,"ws":true,"disabled":false},{"text":"increase","start":47,"end":55,"id":6,"ws":true,"disabled":false},{"text":"in","start":56,"end":58,"id":7,"ws":true,"disabled":false},{"text":"ATF3","start":59,"end":63,"id":8,"ws":true,"disabled":false},{"text":"promoter","start":64,"end":72,"id":9,"ws":true,"disabled":false},{"text":"activity","start":73,"end":81,"id":10,"ws":false,"disabled":false},{"text":",","start":81,"end":82,"id":11,"ws":true,"disabled":false},{"text":"and","start":83,"end":86,"id":12,"ws":true,"disabled":false},{"text":"electrophoretic","start":87,"end":102,"id":13,"ws":true,"disabled":false},{"text":"mobility","start":103,"end":111,"id":14,"ws":true,"disabled":false},{"text":"shift","start":112,"end":117,"id":15,"ws":true,"disabled":false},{"text":"assay","start":118,"end":123,"id":16,"ws":true,"disabled":false},{"text":"identified","start":124,"end":134,"id":17,"ws":true,"disabled":false},{"text":"this","start":135,"end":139,"id":18,"ws":true,"disabled":false},{"text":"region","start":140,"end":146,"id":19,"ws":true,"disabled":false},{"text":"as","start":147,"end":149,"id":20,"ws":true,"disabled":false},{"text":"a","start":150,"end":151,"id":21,"ws":true,"disabled":false},{"text":"core","start":152,"end":156,"id":22,"ws":true,"disabled":false},{"text":"sequence","start":157,"end":165,"id":23,"ws":true,"disabled":false},{"text":"to","start":166,"end":168,"id":24,"ws":true,"disabled":false},{"text":"which","start":169,"end":174,"id":25,"ws":true,"disabled":false},{"text":"ATF2","start":175,"end":179,"id":26,"ws":true,"disabled":false},{"text":"binds","start":180,"end":185,"id":27,"ws":false,"disabled":false},{"text":".","start":185,"end":186,"id":28,"ws":false,"disabled":false}],"_view_id":"relations","relations":[{"head":2,"child":8,"head_span":{"start":18,"end":22,"token_start":2,"token_end":2,"label":"GENETIC"},"child_span":{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},"color":"#c5bdf4","label":"Positive_Regulation"},{"head":26,"child":8,"head_span":{"start":175,"end":179,"token_start":26,"token_end":26,"label":"GENETIC"},"child_span":{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},"color":"#c2f2f6","label":"Binding"}],"answer":"accept","_timestamp":1715203389,"_annotator_id":"test_db-sarah","_session_id":"test_db-sarah"}
{"text":"Overexpression of ATF2 resulted in significant increase in ATF3 promoter activity, and electrophoretic mobility shift assay identified this region as a core sequence to which ATF2 binds.","meta":{"sentence_uid":"20581861:8","source":"ExTRI"},"_input_hash":2007673963,"_task_hash":367958513,"_is_binary":false,"spans":[{"start":18,"end":22,"token_start":2,"token_end":2,"label":"GENETIC"},{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},{"start":175,"end":179,"token_start":26,"token_end":26,"label":"GENETIC"}],"tokens":[{"text":"Overexpression","start":0,"end":14,"id":0,"ws":true,"disabled":false},{"text":"of","start":15,"end":17,"id":1,"ws":true,"disabled":false},{"text":"ATF2","start":18,"end":22,"id":2,"ws":true,"disabled":false},{"text":"resulted","start":23,"end":31,"id":3,"ws":true,"disabled":false},{"text":"in","start":32,"end":34,"id":4,"ws":true,"disabled":false},{"text":"significant","start":35,"end":46,"id":5,"ws":true,"disabled":false},{"text":"increase","start":47,"end":55,"id":6,"ws":true,"disabled":false},{"text":"in","start":56,"end":58,"id":7,"ws":true,"disabled":false},{"text":"ATF3","start":59,"end":63,"id":8,"ws":true,"disabled":false},{"text":"promoter","start":64,"end":72,"id":9,"ws":true,"disabled":false},{"text":"activity","start":73,"end":81,"id":10,"ws":false,"disabled":false},{"text":",","start":81,"end":82,"id":11,"ws":true,"disabled":false},{"text":"and","start":83,"end":86,"id":12,"ws":true,"disabled":false},{"text":"electrophoretic","start":87,"end":102,"id":13,"ws":true,"disabled":false},{"text":"mobility","start":103,"end":111,"id":14,"ws":true,"disabled":false},{"text":"shift","start":112,"end":117,"id":15,"ws":true,"disabled":false},{"text":"assay","start":118,"end":123,"id":16,"ws":true,"disabled":false},{"text":"identified","start":124,"end":134,"id":17,"ws":true,"disabled":false},{"text":"this","start":135,"end":139,"id":18,"ws":true,"disabled":false},{"text":"region","start":140,"end":146,"id":19,"ws":true,"disabled":false},{"text":"as","start":147,"end":149,"id":20,"ws":true,"disabled":false},{"text":"a","start":150,"end":151,"id":21,"ws":true,"disabled":false},{"text":"core","start":152,"end":156,"id":22,"ws":true,"disabled":false},{"text":"sequence","start":157,"end":165,"id":23,"ws":true,"disabled":false},{"text":"to","start":166,"end":168,"id":24,"ws":true,"disabled":false},{"text":"which","start":169,"end":174,"id":25,"ws":true,"disabled":false},{"text":"ATF2","start":175,"end":179,"id":26,"ws":true,"disabled":false},{"text":"binds","start":180,"end":185,"id":27,"ws":false,"disabled":false},{"text":".","start":185,"end":186,"id":28,"ws":false,"disabled":false}],"_view_id":"relations","relations":[{"head":2,"child":8,"head_span":{"start":18,"end":22,"token_start":2,"token_end":2,"label":"GENETIC"},"child_span":{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},"color":"#c5bdf4","label":"Positive_Regulation"},{"head":26,"child":8,"head_span":{"start":175,"end":179,"token_start":26,"token_end":26,"label":"GENETIC"},"child_span":{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},"color":"#c2f2f6","label":"Binding"}],"answer":"accept","_timestamp":1715203246,"_annotator_id":"test_db-joe","_session_id":"test_db-joe"}
{"text":"Overexpression of ATF2 resulted in significant increase in ATF3 promoter activity, and electrophoretic mobility shift assay identified this region as a core sequence to which ATF2 binds.","meta":{"sentence_uid":"20581861:8","source":"ExTRI"},"_input_hash":2007673963,"_task_hash":367958513,"_is_binary":false,"spans":[{"start":18,"end":22,"token_start":2,"token_end":2,"label":"GENETIC"},{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},{"start":175,"end":179,"token_start":26,"token_end":26,"label":"GENETIC"}],"tokens":[{"text":"Overexpression","start":0,"end":14,"id":0,"ws":true,"disabled":false},{"text":"of","start":15,"end":17,"id":1,"ws":true,"disabled":false},{"text":"ATF2","start":18,"end":22,"id":2,"ws":true,"disabled":false},{"text":"resulted","start":23,"end":31,"id":3,"ws":true,"disabled":false},{"text":"in","start":32,"end":34,"id":4,"ws":true,"disabled":false},{"text":"significant","start":35,"end":46,"id":5,"ws":true,"disabled":false},{"text":"increase","start":47,"end":55,"id":6,"ws":true,"disabled":false},{"text":"in","start":56,"end":58,"id":7,"ws":true,"disabled":false},{"text":"ATF3","start":59,"end":63,"id":8,"ws":true,"disabled":false},{"text":"promoter","start":64,"end":72,"id":9,"ws":true,"disabled":false},{"text":"activity","start":73,"end":81,"id":10,"ws":false,"disabled":false},{"text":",","start":81,"end":82,"id":11,"ws":true,"disabled":false},{"text":"and","start":83,"end":86,"id":12,"ws":true,"disabled":false},{"text":"electrophoretic","start":87,"end":102,"id":13,"ws":true,"disabled":false},{"text":"mobility","start":103,"end":111,"id":14,"ws":true,"disabled":false},{"text":"shift","start":112,"end":117,"id":15,"ws":true,"disabled":false},{"text":"assay","start":118,"end":123,"id":16,"ws":true,"disabled":false},{"text":"identified","start":124,"end":134,"id":17,"ws":true,"disabled":false},{"text":"this","start":135,"end":139,"id":18,"ws":true,"disabled":false},{"text":"region","start":140,"end":146,"id":19,"ws":true,"disabled":false},{"text":"as","start":147,"end":149,"id":20,"ws":true,"disabled":false},{"text":"a","start":150,"end":151,"id":21,"ws":true,"disabled":false},{"text":"core","start":152,"end":156,"id":22,"ws":true,"disabled":false},{"text":"sequence","start":157,"end":165,"id":23,"ws":true,"disabled":false},{"text":"to","start":166,"end":168,"id":24,"ws":true,"disabled":false},{"text":"which","start":169,"end":174,"id":25,"ws":true,"disabled":false},{"text":"ATF2","start":175,"end":179,"id":26,"ws":true,"disabled":false},{"text":"binds","start":180,"end":185,"id":27,"ws":false,"disabled":false},{"text":".","start":185,"end":186,"id":28,"ws":false,"disabled":false}],"_view_id":"relations","relations":[{"head":26,"child":8,"head_span":{"start":175,"end":179,"token_start":26,"token_end":26,"label":"GENETIC"},"child_span":{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},"color":"#c2f2f6","label":"Binding"},{"head":2,"child":8,"head_span":{"start":18,"end":22,"token_start":2,"token_end":2,"label":"GENETIC"},"child_span":{"start":59,"end":63,"token_start":8,"token_end":8,"label":"GENETIC"},"color":"#c5bdf4","label":"Positive_Regulation"}],"answer":"accept","_timestamp":1715203358,"_annotator_id":"test_db-jane","_session_id":"test_db-jane"}

My review.py looks like this around line 152 with the fix:

for eg in examples:
            # Rehash example to make sure we're comparing correctly. In this
            # case, we want to consider "options" an input key and "accept" a
            # task key, so we can treat choice examples as by_input. We also
            # want to ignore the answer and key by it separately.
            
            # Fix
            
            if eg[VIEW_ID_ATTR] == "relations":
                relations = eg.get("relations")
                if relations:
                    sorted_relations = sorted(relations, key=lambda x: x["child"])
                    eg["relations"] = sorted_relations
            
            # End of fix
            
            eg = set_hashes(
                eg,
                overwrite=True,
                input_keys=INPUT_KEYS,
                task_keys=TASK_KEYS,
                ignore=IGNORE_HASH_KEYS,
            )

Thanks again.

Hi @ale,

This is my bad - apologies! We should actually be sorting first by head and then by child to cover all cases where the head or child can be the same token.
Could you modify the fix snippet like so (I have already updated my original answer):

if eg[VIEW_ID_ATTR] == "relations":
    relations = eg.get("relations")
    if relations:
        sorted_relations = sorted(relations, key=lambda x: (x["head"], x["child"])) # updated sort
        eg["relations"] = sorted_relations
1 Like

Thanks! It is working perfectly.

1 Like