Can the Span Categorizer import nested json lists?

grahama · November 19, 2023, 12:13pm

Hello, I need to import a bunch of decomposed sentences into Prodigy for human correction before sending off to a model. I’m in the middle of figuring out a method to do the below…importing a LLM output, correct in prodigy, send corrections/updates to a model….
Question— the prodigy span cat format is flat, correct? My original format is nested, which ( think is easier for me to digest)
I can write a script to flatten the list, as shown below? Or, can prodigy read nested lists, as the graphic interface seems to imply. I’m too new to Prodigy—bought the license last week.
Any help or push in the right direction is appreciated

import json

# Nested JSON input
nested_json = [
    {
        "term": "cond",
        "text": "When X is between 7 and 12 inches",
        "pos": [0, 33],
        "details": {
            "exp": {
                "text": "X is between 7 and 12 inches",
                "pos": [5, 33],
                "details": {
                    "param": "X",
                    "value": [7, 12],
                    "unit": "inches"
                }
            }
        }
    },
    {
        "term": "subj",
        "text": "the system",
        "pos": [35, 44]
    },
    {
        "term": "pred",
        "text": "send a class Z type message to the mix bus",
        "pos": [46, 88],
        "details": {
            "act": {
                "text": "send",
                "pos": [46, 49]
            },
            "obj": {
                "text": "a class Z type message",
                "pos": [51, 72]
            },
            "dest": {
                "text": "to the mix bus",
                "pos": [74, 88]
            }
        }
    }
]

# Function to extract spans
def extract_spans(nested_json):
    spans = []
    for item in nested_json:
        term = item['term'].upper()
        spans.append({"start": item["pos"][0], "end": item["pos"][1], "label": term})
        if 'details' in item:
            for key, value in item['details'].items():
                label = key.upper()
                spans.append({"start": value["pos"][0], "end": value["pos"][1], "label": label})
    return spans

# Extract spans
spans = extract_spans(nested_json)

# Create Prodigy compatible JSON
prodigy_json = {
    "text": "When X is between 7 and 12 inches, the system shall send a class Z type message to the mix bus",
    "spans": spans
}

# Output the Prodigy formatted JSON
print(json.dumps(prodigy_json, indent=2))

Outputs:

[
  {
    "text": "When X is between 7 and 12 inches, the system shall send a class Z type message to the mix bus",
    "spans": [
      {
        "start": 0,
        "end": 33,
        "label": "COND"
      },
      {
        "start": 5,
        "end": 33,
        "label": "EXP"
      },
      {
        "start": 35,
        "end": 44,
        "label": "SUBJ"
      },
      {
        "start": 46,
        "end": 88,
        "label": "PRED"
      },
      {
        "start": 46,
        "end": 49,
        "label": "ACT"
      },
      {
        "start": 51,
        "end": 72,
        "label": "OBJ"
      },
      {
        "start": 74,
        "end": 88,
        "label": "DEST"
      }
    ]
  }
]

Topic		Replies	Views
Convert spancat annotations for use with transformer model usage , spacy , transformers , spancat	4	697	August 11, 2022
hierarchical text classification using spancat and potentially expanding/hiding label subclasses as they come in context textcat , front-end , spancat	6	473	September 21, 2022
Best practice for external LLM script to prodigy spancat import/launch usage , ner , best-practices , spancat	0	182	November 18, 2023
Custom Span Categorizer - Linebreaks? usage , front-end , solved , spancat	2	607	December 31, 2021
TypeError when reviewing annotations spans.manual spancat	3	289	January 6, 2023

Can the Span Categorizer import nested json lists?

Related topics