--label-field and --choice-field error with v.1.14.0

koaning · November 3, 2023, 3:12pm

Happy to hear it!

cheyanneb · November 7, 2023, 2:38pm

The errors in this thread were solved, but I have a new one that is preventing us from upgrading to v1.14.6.

We are getting an error msg that states it cannot find any of our custom recipes:

✘ Can't find recipe or command 'ner_task_v3_b_validation'.
Run prodigy --help for more details. If you're using a custom recipe, provide
the path to the Python file using the -F argument.
Available recipes: ab.llm.tournament, ab.openai.prompts, ab.openai.tournament,
abandonment, account-matching, agent-speedbump, audio.manual, audio.transcribe,
compare, coref.manual, custom-intent-calibration, custom-labels,
custom-labels-prompt-multi-intent, data-to-spacy, db-in, db-merge, db-out,
dep.correct, dep.teach, drop, eup-corpus-validation,
eup-corpus-validation-with-options, filter-by-patterns, helpful-banking-moments,
helpful-banking-moments-repeat-intents, image.manual, login-failure, mark,
match, metric.iaa.binary, metric.iaa.doc, metric.iaa.span, ner-spans,
ner.correct, ner.eval-ab, ner.llm.correct, ner.llm.fetch, ner.manual,
ner.model-annotate, ner.openai.correct, ner.openai.fetch, ner.silver-to-gold,
ner.teach, pos.correct, pos.teach, print-dataset, print-stream, progress,
rel.manual, review, sent.correct, sent.teach, spacy-config, spans.correct,
spans.llm.correct, spans.llm.fetch, spans.manual, spans.model-annotate, stats,
stt-error-validation, stt-spans, terms.llm.fetch, terms.openai.fetch,
terms.teach, terms.to-patterns, textcat.correct, textcat.llm.correct,
textcat.llm.fetch, textcat.manual, textcat.model-annotate,

The error only appears when we run the prodigy server programmatically, i.e., using the prodigy.serve() method as explained here (https://prodi.gy/docs/api-components#serve), and that it worked just fine until we upgraded to v1.14.6.

And this error appears for all our custom recipes, not just the one in the error message above.

We tried adding the recipe path to the serve() command (with -F, as I do when I invoke prodigy directly) but that did not solve the error.

Were there any changes in this update that would have caused this?

koaning · November 8, 2023, 9:17am

The only changes in v1.14.6 should be the Pydantic/spaCy version bumps. But I can't imagine that would cause this issue.

Looking at your output though ... I can't help but notice stt-spans and stt-error-validation in the list of known recipes. So it is able to detect some of your custom recipes it seems.

I also just downloaded Prodigy v1.14.6 to try and reproduce your error. I have this custom recipe:

import prodigy

@prodigy.recipe(
    "my-custom-recipe",
    dataset=("Dataset to save answers to", "positional", None, str),
    view_id=("Annotation interface", "option", "v", str)
)
def my_custom_recipe(dataset, view_id="text"):
    # Load your own streams from anywhere you want
    stream = [{"text": f"omg {i}"} for i in range(1000)]

    def update(examples):
        # This function is triggered when Prodigy receives annotations
        print(f"Received {len(examples)} annotations!")

    return {
        "dataset": dataset,
        "view_id": view_id,
        "stream": stream,
        "update": update
    }

I'm able to confirm that this runs fine:

python -m prodigy my-custom-recipe xxx --view-id text -F recipe.py

When I move it into a Python script, like so:

import prodigy 

prodigy.serve("prodigy my-custom-recipe xxx --view-id text -F recipe.py", port=9000)

Then I seem to hit the same issue.

✘ Can't find recipe or command 'my-custom-recipe'.
Run prodigy --help for more details. If you're using a custom recipe, provide
the path to the Python file using the -F argument.
Available recipes: ab.llm.tournament, ab.openai.prompts, ab.openai.tournament,
audio.manual, audio.transcribe, compare, coref.manual, data-to-spacy, db-in,
db-merge, db-out, dep.correct, dep.teach, drop, filter-by-patterns,
image.manual, mark, match, metric.iaa.binary, metric.iaa.doc, metric.iaa.span,
ner.correct, ner.eval-ab, ner.llm.correct, ner.llm.fetch, ner.manual,
ner.model-annotate, ner.openai.correct, ner.openai.fetch, ner.silver-to-gold,
ner.teach, pos.correct, pos.teach, print-dataset, print-stream, progress,
rel.manual, review, sent.correct, sent.teach, spacy-config, spans.correct,
spans.llm.correct, spans.llm.fetch, spans.manual, spans.model-annotate, stats,
terms.llm.fetch, terms.openai.fetch, terms.teach, terms.to-patterns,
textcat.correct, textcat.llm.correct, textcat.llm.fetch, textcat.manual,
textcat.model-annotate, textcat.openai.correct, textcat.openai.fetch,
textcat.teach, train, train-curve

When I revert to version v1.14.1 however, I seem to get the same error.

> python -m pip install prodigy==1.14.1 -f https://<lincense-key>@download.prodi.gy
> python serve.py

Just to check, what version of Prodigy does work for you here? I'm definately eager to dive into this, but it would help to know when this feature might've broken.

koaning · November 8, 2023, 9:20am

Ah! I think I've spotted the issue. Not 100% sure, but this might be it.

This was my serve.py file originally.

import prodigy 

prodigy.serve("prodigy my-custom-recipe xxx --view-id text -F recipe.py", port=9000)

The reason why you pass -F recipe.py locally is because that file needs to run in order for the prodigy.recipe decorator to register the recipe. But we can also achieve that by just importing within the Python script.

import prodigy 
import recipe

prodigy.serve("prodigy my-custom-recipe xxx --view-id text", port=9000)

When I run import recipe, the entire script runs as a side-effect, which also registers the recipe. From there prodigy.serve is able to run it.

Might this explain what is happening on your end?

cheyanneb · November 8, 2023, 2:34pm

Here is our server.py, which seems similar to your serve.py. I tried adding from prodigy import recipes, and from prodigy.core import recipe, but this didn't resolve the error. I'm not sure why two recipes are "working" with the latest update, and the rest are not -- but here's our code in case this helps troubleshoot.

import logging
import multiprocessing as mp
import os
import time
from datetime import datetime
from threading import Lock
from typing import Dict, List

import prodigy
from google.cloud import storage

# The import is used implicitly by the "run_server" method below
from annotator import GCS_FILES_FOLDER, GCS_ROOT, recipes  # noqa
from annotator.utils import timestamp

logger = logging.getLogger("hypercorn.access")
gcs_client = storage.Client(project="xxxx")


class TaskDefinition(dict):
    """
    An annotation task consists of a named recipe and a dataset
    """
    _required_keys = ("recipe", "dataset")
    _optional_keys = (
        "filepath", "input_sets", "spacy_model", "labels", "label_field",
        "choice_field", "view_id"
    )

    def __init__(self, **kwargs):
        """
        Make sure that all the required key/values are present and that all the
        keys are known (either required or optional)
        """
        for key in self._required_keys:
            assert key in kwargs
        for key in list(kwargs):
            assert key in self._required_keys + self._optional_keys, \
                f"{key} is not a recognized attribute"
        self.update({k: v for k, v in kwargs.items()})

    def convert_to_args(self) -> str:
        """
        Convert the task's attributes to a string of arguments that can be
        passed to the prodigy serve command

        :return: a string with the command-line arguments
        """
        # The filepath attribute is the GCS location, which consists of two
        # parts: the first two characters identify the subfolder, the remaining
        # 30 are the filename. We discard the subfolder and prepend "data" to
        # the remainder to obtain the target path (i.e., where the GCS download
        # will store the file), then use it (rather than the original filepath)
        # as the source specification (third argument) for the prodigy command
        command_args = [self["recipe"], self["dataset"]]
        if "filepath" in self:
            bucket = gcs_client.bucket(GCS_ROOT)
            source_path = os.path.join(
                GCS_FILES_FOLDER, self["filepath"][:2], self['filepath'][2:]
            )
            target_path = f"data/{self['filepath'][2:]}.jsonl"
            blob = bucket.blob(source_path)
            blob.download_to_filename(target_path)
            logger.info(f"DOWNLOADED [GCS] '{source_path}' to '{target_path}'")
            command_args.append(target_path)

        if "spacy_model" in self:
            command_args.insert(2, self["spacy_model"])
        if "input_sets" in self:
            command_args.append(f"{','.join(self['input_sets'])}")
        if "labels" in self:
            command_args.append(f"--label {','.join(self['labels'])}")
        if "label_field" in self:
            command_args.append(f"-l {self['label_field']}")
        if "choice_field" in self:
            command_args.append(f"-c {self['choice_field']}")
        if "view_id" in self:
            command_args.append(f"--view-id {self['view_id']}")
        logger.info(f"CREATING task with '{command_args}")
        return " ".join(command_args)


class ProdigyServer:
    """
    Stores information about the Prodigy web server (such as the port number)
    and manages the start and termination of the actual subprocess.
    """
    @classmethod
    def set_url_prefix(cls, prefix: str) -> None:
        """
        Set the URL prefix to be used by all servers

        :param prefix: a string, obtained from the config file, and dependent
        on the environment where the app is running
        """
        cls.prefix = prefix

    def __init__(self, port_num: int):
        """
        Create a new server with the given port number
        """
        self.port_num = port_num
        self._proc = None
        self.start_time = None

    @property
    def url(self):
        return self.prefix + str(self.port_num)

    def is_available(self):
        return self._proc is None or not self._proc.is_alive()

    def is_running(self):
        return self._proc.is_alive()

    def start(
        self, taskdef: TaskDefinition, start_time: datetime = None,
        wait_time: int = 10
    ):
        """
        Start the server with the attributes from the task definition and give
        it a bit of time to settle down

        :param taskdef: the TaskDefinition (with the recipe name, filepath, and
        other attributes required for the server command)
        :param start_time: date and time the task first started; will not be
        None if the task was recreated from the active_tasks table at startup
        :param wait_time: the number of seconds to wait to give the server a
        chance to initialize properly (default: 10)
        """
        # Start the Prodigy webserver and give it 10 seconds before returning
        # (the .is_alive() method returns True immediately, so we cannot wait
        # for that)
        command_args = taskdef.convert_to_args()
        self._proc = mp.Process(
            target=run_server, args=(command_args, self.port_num,)
        )
        self._proc.daemon = False
        self._proc.start()
        time.sleep(wait_time)
        self.start_time = start_time or timestamp()

    def terminate(self):
        """
        If the server process is currently running, terminate it
        """
        if self._proc.is_alive():
            self._proc.terminate()
            while self._proc.is_alive():
                time.sleep(.1)
        self._proc = None
        self.start_time = None


class AnnotationTask:
    """
    Stores the task definition and manages the multiprocessing.Process for an
    annotation task. Also keeps a list of annotators working on the task.
    """
    # A finite list of available port numbers; the Helm chart assign explicit
    # addresses to each of them, so they cannot be random
    _reserved_ports: List[int] = [port_num for port_num in range(9091, 9101)]

    # For use as a context manager so only one thread can obtain/return a port
    # number at any one time
    _lock = Lock()

    @classmethod
    def _claim_port(cls, port_num: int = None) -> int:
        """
        Claim a port by number, or get the next available one; raises an error
        if we've run out of port numbers

        :param port_num: the port number to assign to the server; if not given,
        select the next available one from the list of reserved port numbers
        :return: an integer value between 9091 and 9100 (inclusive)
        :raises: ValueError if no port numbers are available
        """
        with cls._lock:
            if cls._reserved_ports:
                if port_num is not None:
                    cls._reserved_ports.remove(port_num)
                else:
                    port_num = cls._reserved_ports.pop(0)
                return port_num
        raise ValueError("All reserved port numbers are taken")

    @classmethod
    def available_ports(cls) -> List[int]:
        return cls._reserved_ports

    def __init__(self, taskdef: TaskDefinition, port_num: int = None):
        """
        Start a new Prodigy server with the task described in the definition

        :param taskdef: an object with all the attributes needed to start the
        process
        :param port_num: the port number for the task; if not specified, select
        the next available one
        """
        port_num = AnnotationTask._claim_port(port_num=port_num)
        self._task_def = taskdef
        self._annotators = set()
        self._server = ProdigyServer(port_num)

    def start(self, start_time: datetime = None, wait_time: int = 10):
        """
        :param start_time: date and time the task first started; will not be
        None if the task was recreated from the active_tasks table at startup
        :param wait_time: the number of seconds to wait to give the server a
        chance to initialize properly (default: 10)
        """
        self._server.start(
            self._task_def, start_time=start_time, wait_time=wait_time
        )

    @property
    def url(self) -> str:
        """
        Return the externally accessible address for the server handling the
        current task

        :return: a URL (string)
        """
        return self._server.url

    def is_running(self):
        return self._server.is_running()

    def add_annotator(self, annotator_name: str) -> None:
        self._annotators.add(annotator_name)

    def terminate(self):
        """
        Terminate the Prodigy server process; as a side effect, the port number
        assigned to the server is now available for another task
        """
        with self._lock:
            self._reserved_ports.append(self._server.port_num)
            self._server.terminate()

    def summary(self) -> Dict:
        """
        Return useful information about the task
        """
        return {
            "task": self._task_def,
            "url": self._server.url,
            "annotators": list(self._annotators),
            "started_at": self._server.start_time
        }


def run_server(command_args: str, port: int):
    prodigy.serve(command_args, port=port, host="0.0.0.0")

koaning · November 9, 2023, 10:24am

Ah! The reason that I import recipe is because this file is called recipe.py.

import prodigy

@prodigy.recipe(
    "my-custom-recipe",
    dataset=("Dataset to save answers to", "positional", None, str),
    view_id=("Annotation interface", "option", "v", str)
)
def my_custom_recipe(dataset, view_id="text"):
    # Load your own streams from anywhere you want
    stream = [{"text": f"omg {i}"} for i in range(1000)]

    def update(examples):
        # This function is triggered when Prodigy receives annotations
        print(f"Received {len(examples)} annotations!")

    return {
        "dataset": dataset,
        "view_id": view_id,
        "stream": stream,
        "update": update
    }

If your file with your custom recipes is called dinosaurhead.py then you should be able to have this important statement in your server.py file.

import dinosaurhead

If you import using the file name, does that help?

cheyanneb · November 13, 2023, 3:14pm

We tried this latest approach, and it didn't work for us. So, we're still on v1.13.3 because it is stable for us, and quite far behind the current version.

We re-installed v1.13.3 (which is what's currently working on staging) and commented out the , recipes part from this line in server.py:

from annotator import GCS_FILES_FOLDER, GCS_ROOT, recipes

to

from annotator import GCS_FILES_FOLDER, GCS_ROOT

That produces the error ✘ Can't find recipe 'account-matching' and no list of available recipes. This seems to confirm that it is the global import from recipes that loads the available recipes. And in the error message that initially prompted our investigation, all our custom recipes are listed as Available recipes:. So the question remains: why did this approach work prior to version v.1.14.*? To which we now add the following one: why can v1.14.6 not load a recipe that is listed as available?

Topic		Replies	Views
labels extra fields not permitted usage , solved , relations	3	1328	August 3, 2021
Can't get labels to be shown. docs , usage , textcat , done , solved	6	1360	May 28, 2020
Review recipe error: dataset field required usage , ner	3	276	August 10, 2023
train-curve error (unexpected keyword argument 'label') textcat , done	1	788	October 10, 2017
prodigy custom-ner: error: unrecognized arguments usage , ner , custom	2	642	November 16, 2021

--label-field and --choice-field error with v.1.14.0

Related topics