make-gold or manual for textcat?

Hi! Is there a make-gold or manual for textcat? It’s for making a validation set or getting the model kickstarted.

Not at the moment – mostly because this type of logic should be pretty easy to implement using the existing interfaces and a simple custom recipe :slightly_smiling_face: See this page for an example of the choice interface. Depending on your label scheme, you can set "choice_style": "multiple" in your config to allow multiple selection, or "choice_auto_accept": true to automatically accept a task if an option was selected.

The only difference to the manual NER recipes is that a solution using the choice interface will produce more “general” data – i.e. it won’t set the "label" specifically and instead, list the options the user selected. For example, if your options have the IDs 1, 2 and 3 and the user selects the first two, the task will include "accept": [1, 2]. This should be fairly easy to convert, though. For example:

textcat_examples = []

for example in CHOICE_EXAMPLES:  # the exported examples
    for option_id in example.get('accept', []):  # iterate over accepted ids
        eg = dict(example)  # copy task
        eg['label'] = option_id  # set label
        textcat_examples.append(eg)

Thanks! To save a little time for others I’ll add my code:

import prodigy
from prodigy.components.loaders import JSONL

@prodigy.recipe('textcat_manual',
    dataset=prodigy.recipe_args['dataset'],
    file_path=("Path to texts", "positional", None, str),
    label=prodigy.recipe_args['label'])
def textcat_manual(dataset, file_path,label):
    """Annotate the sentiment of texts using different mood options."""
    stream = JSONL(file_path)     # load in the JSONL file
    stream = add_options(stream,label)  # add options to each task
    
    return {
        'dataset': dataset,   # save annotations in this dataset
        'view_id': 'choice',  # use the choice interface
        'stream': stream,
    }
def add_options(stream,label):
    """Helper function to add options to every task in a stream."""

    options = [{'id':l ,'text': l} for l in label.split(',')]
     
    for task in stream:
        task['options'] = options
        yield task

And you call the recipe like so(working from same folder as textcat_manual.py is held):

prodigy textcat_manual [db-name]  [path to validation jsonl ] --label  label1,label2,...,labeln -F textcat_manual.py

Then you have to clean up the annotations as Ines showed. The following built in prodigy functions are helpful:

from prodigy.components.db import connect

db = connect()  # connect to the DB using the prodigy.json settings
textcat_examples = db.get_dataset(db-name)
# Ines code here
db.add_dataset(new_db_name)
db.add_examples(textcat_examples, datasets=[new_db_name])

Then you can use textcat.batch-train with the flag --eval-id new_db_name

3 Likes