First Project Data won't load to prodigy

GitMatt-design · August 11, 2023, 5:17am

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.util import split_spans

@prodigy.recipe("custom_ner_recipe",
                dataset=("Menus", "positional", None, str),
                label1=("menu_item", "positional", None, str),
                label2=("price", "positional", None, str),
                span_label=("Menu_group", "positional", None, str))

def custom_ner_annotation(dataset, Menu_Item, price, menu_group):
    def add_tokens_to_spans(examples):
        for eg in examples:
            text = eg["text"]  # Assuming your data has a "text" field
            menu_items = eg.get("MENU_ITEM", ["Meat Lovers"])
            prices = eg.get("PRICE", ["1.99"])  # Assuming you have a PRICE field
            menu_groups = eg.get("Menu_group", ["Extra Large Pizza"])  # Assuming you have a Menu_group field

            spans = []
            for menu_item in menu_items:
                spans.append({
                    "start": text.index(menu_item),
                    "end": text.index(menu_item) + len(menu_item),
                    "label": menu_item  # Replace with the appropriate label for menu items
                })

            for price in prices:
                spans.append({
                    "start": text.index(price),
                    "end": text.index(price) + len(price),
                    "label": price  # Replace with the appropriate label for prices
                })

            for menu_group in menu_groups:
                spans.append({
                    "start": text.index(menu_group),
                    "end": text.index(menu_group) + len(menu_group),
                    "label": menu_group  # Replace with the appropriate label for menu groups
                })

            eg["spans"] = spans
            yield eg

    stream = JSONL("C:\Users\matt\Downloads\json-fixer (3).jsonl")  # Corrected path

    components = [
        add_tokens_to_spans,
    "ner_manual",  # Built-in NER annotation UI
    {
        "label": menu_group,  # Use span_label instead of menu_group
        "pattern": [{"label": menu_group}],  # Corrected pattern label
        "on_exit": prodigy.set_hashes,
    },
]

    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "ner_manual",
        "config": {
            "labels": [Menu_Item, price,menu_group],
            "exclude_by": input,
        },
        "update": None,
        "before_db": None,
        "after_db": None,
        "on_exit": None,
        "config_auto": False,
        "progress": None,
        "total": None,
        "get_session_id": None,
        "components": components,
    }

The first few lines of the jsonl file look like this

[
{"restaurant_id": "restaurant_id", "category": "category", "name": "name", "description": "description", "price": "price"},
{"restaurant_id": "1", "category": "Extra Large Pizza", "name": "Extra Large Meat Lovers", "description": "Whole pie.", "price": "15.99 USD"},
{"restaurant_id": "1", "category": "Extra Large Pizza", "name": "Extra Large Supreme", "description": "Whole pie.", "price": "15.99 USD"},
{"restaurant_id": "1", "category": "Extra Large Pizza", "name": "Extra Large Pepperoni", "description": "Whole pie.", "price": "14.99 USD"},

The error I am getting is
Using 18 labels from model: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC,
MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

✘ Error while validating stream: no first example.
This likely means that your stream is empty. This can also mean all the examples
in your stream have been annotated in datasets included in your --exclude recipe
parameter.

This is my very first NER project for spacy forgive me if I am missing something basic. Any help would be wonderful.

ryanwesslen · August 11, 2023, 2:34pm

hi @GitMatt-design,

Thanks for your question and welcome to the Prodigy community

A couple of things I noticed.

First, by default, Prodigy looks for a "text" key in your input .jsonl as the text that will be annotated. Here's a link on the appropriate input format. You don't have one in your input file:

GitMatt-design:

[
{"restaurant_id": "restaurant_id", "category": "category", "name": "name", "description": "description", "price": "price"},
{"restaurant_id": "1", "category": "Extra Large Pizza", "name": "Extra Large Meat Lovers", "description": "Whole pie.", "price": "15.99 USD"},
{"restaurant_id": "1", "category": "Extra Large Pizza", "name": "Extra Large Supreme", "description": "Whole pie.", "price": "15.99 USD"},
{"restaurant_id": "1", "category": "Extra Large Pizza", "name": "Extra Large Pepperoni", "description": "Whole pie.", "price": "14.99 USD"},

You may also want to use get_stream instead of JSONL loader. This is how

Also - I noticed your .jsonl has a list so it may not be right format. .jsonl should only be dictionaries by each new line. You even have an inconsistencies in your second row with the "restaurant_id" specified as both a value and key.

But what values in your .jsonl do you want annotated? Maybe description? That's what Prodigy is looking to display.

Also - just curious on your custom recipe -- did you try to generate this, say thru ChatGPT? ChatGPT can try to approximate custom recipes but there look like some interesting choices that don't really work for a custom recipe. So even if you fix your input file, I'm not sure if your recipe will even work.

One tip - if you're creating custom recipes, it's nice to learn from the built-in recipes, which you can find in your local Prodigy installed library in the path Location: shown in your prodigy stats command. Then look in the recipes folder where you can find built-in recipes.

Can you describe what task you're trying to accomplish with your recipe?

GitMatt-design · August 11, 2023, 7:07pm

Hi @ryanwesslen

Thank you for your reply. Yes Since this was my first prodigy project I have been using Chat GPT since I feel well I understand the documentation. This is my first time actually writing the program.

The idea for this project is for me to use spacy for Menu_items/Prices and I want a span for Groups that contain each Menu_items/Prices.

I took this data from a csv file that I just converted to jsonl which is why the data looks like that.

ryanwesslen · August 14, 2023, 2:13pm

Can you try to convert your raw csv to Prodigy's preferred input/source format:

{"text": "This is a description.", "meta": {"restaurant_id": 1}}
{"text": "This is another description.", "meta": {"restaurant_id": 2}}

You can add whatever meta fields you want.

Sorry, I'm not sure what you mean. So you have lots of food items along with metadata on them like its food category, restaurant, and price. What's your goal with spans? It seems like you have relatively little text so I'm not sure what spans you'd expect to annotate.

If you're trying to categorize the food items, it seems like classification (choice) would be more appropriate. But maybe I'm missing something.

GitMatt-design · August 15, 2023, 1:46am

Hi @ryanwesslen agian....

I just saw that prodigy has csv stream to load the data directly. My main project is to OCR menus and sort it by their groups which would be the span IE all pop/soda would be in one group. when it reads it as menu items and price is self explanatory.

I have a large data of over 10K items but I wanted to try and load prodigy at all since I am getting used to the tool. This is my first NLP project.

ryanwesslen · August 16, 2023, 3:26pm

Thanks for the background.

If it would help, we've had similar requests for annotating invoices with Prodigy and/or spaCy. I wrote a recent post with examples, including LJ's nice project/blog showing how to annotate with Prodigy with a Huggingface model-in-the-loop:

You can also clone his project repo. If you try this, I would recommend trying first with Prodigy v1.11.14 (i.e., run pip install prodigy==1.11.14 -f https:// XXXX-XXXX-XXXX-XXXX @download.prodi.gy, where xxxx is your license key). The reason is there could be breaking changes in v1.12 or v1.13.

Although, this is an intermediate-to-advanced Prodigy project as you'll need to read up on some basics about spaCy projects and setting up tesseract for OCR and huggingface for the model-in-the-loop. But it may be helpful to see what can be done with Prodigy, especially if this is one of your first NLP projects. Let me know if you want to try this out and I can coach you on how to try to reproduce LJ's project first before adding in your data.

Topic		Replies	Views
Cant load pre-annotated ner jsonl usage , ner , solved	8	1174	June 24, 2020
Custom relation recipe usage , front-end , relations	2	365	December 27, 2021
JSONL with annotation for NET multi-tag for newbies usage , ner	3	648	February 14, 2022
Re-labling custom dataset with Prodigy usage , ner	2	605	June 28, 2021
Loading pre-annotated data that has multiple sub-labels per word usage , spancat	1	604	June 27, 2021

First Project Data won't load to prodigy

Related topics