Bulk import textcat examples

Hey everyone,

I'm currently working on a dataset with bank transaction data. I use a pre-calculation-step that clusters example into JSONL files. I review these files and group the examples into JSONL files that apply to a certain label or label-set. So I end up with JSONL files that I want to import into Prodigy with a given set of labels.

I've written my own custom recipe for this:

import json
from collections.abc import Iterable
from typing import Union

from prodigy.components.db import connect
from prodigy.components.stream import get_stream
from prodigy.core import Arg, recipe


@recipe(
    "bulk-import",
    # fmt: off
    dataset=Arg(help="Dataset to save annotations to"),
    source=Arg(help="Data to import (file path or '-' to read from standard input)"),
    label=Arg(
        "--label", "-l", help="Comma-separated label(s) to assign to all examples"
    ),
    loader=Arg(
        "--loader", "-lo", help="Loader (guessed from file extension if not set)"
    ),
    # fmt: on
)
def bulk_import(
    dataset: str,
    source: Union[str, Iterable[dict]], #noqa
    label: Union[str, None] = None, #noqa
    loader: Union[str, None] = None, #noqa
) -> None:
    """
    Bulk import JSON data into a dataset with fixed labels.
    """
    print("RECIPE: Starting recipe bulk-import", locals())
    if label:
        label = [l.strip() for l in label.split(",")]

    # Validate the source as a JSONL file
    if isinstance(source, str) and source != "-":
        try:
            with open(source, encoding="utf-8") as f:
                for i, line in enumerate(f, start=1):
                    try:
                        json.loads(line)
                    except json.JSONDecodeError as e:
                        print(f"RECIPE: Invalid JSON on line {i} in file '{source}': {e}")
                        raise ValueError(
                            f"Invalid JSON on line {i} in file '{source}': {e}"
                        ) from e
        except FileNotFoundError as e:
            raise ValueError(f"Source file '{source}' not found.") from e
        except Exception as e:
            raise ValueError(f"Error reading source file '{source}': {e}") from e

    stream = get_stream(source, loader=loader, rehash=True, input_key="text")
    stream = list(stream)
    if not stream:
        raise ValueError("No examples loaded from the source. Make sure each JSONL line has a 'text' field.")

    print(f"Loaded {len(stream)} examples")

    # Add the fixed labels to each example
    def add_labels_to_stream(stream, labels):
        for example in stream:
            example["accept"] = labels
            example["answer"] = "accept"
            yield example

    print(f"Adding labels {label} to examples")

    if label:
        stream = add_labels_to_stream(stream, label)

    db = connect()
    if dataset not in db.datasets:
        db.add_dataset(dataset)
    db.add_examples(stream, [dataset])

    print(f"RECIPE: Successfully imported data into dataset '{dataset}'")

Since after my last imports my model performed worse, I now wanted to make sure that I'm no missing anything in the recipe.

Is the above correct? Do I need to keep anything in mind when adding new categories through this process? Do they need to be added to the options somehow?

Thanks for you help!

Just based on guessing I remove all examples where .answer is not accept and set the .options key on all examples to always include all categories. That improved the model manifold!

But still the scores on the newly added categories (via the above bulk import) are still not very good or barely even register.

There must be something I'm missing?

Hi @toadle,

I understand you'll be training a multilabel text classifier from this data? If so, then you indeed should provide "options" to the bulk examples (apart from the accept list). Concretely options should follow this format:

  "options": [
    {"id": "BANANA", "text": "🍌 banana"},
    {"id": "BROCCOLI", "text": "🥦 broccoli"},
    {"id": "TOMATO", "text": "🍅 tomato"}
  ]

The logic that translates Prodigy examples to textcat training examples iterates through options and checks if the id is on the accept list. If that's the case id is considered a label.
If the options do not contain what may appear on the accept list, it will simply be ignored.
If there are no options at all, the example will be treated as a binary classification task.

Also, there's no need to filter out the reject answers. The train script handles them in a way that if the multilabel textcat answer is rejected, the selected labels are considered as False - which also is a useful information to the model.

Finally, if you ever would like to edit the bulk examples with a manual Prodigy UI, you'd need to add these labels to the options so that they are editable.

Apart from the formatting issues, the performance difference may also be related to the fact that bulk annotation might introduce some inconsistencies with respect to previously annotated data if there are duplicates or near duplicates. But it, of course, might not be case at all.