Hi! Is there a make-gold or manual for textcat? It’s for making a validation set or getting the model kickstarted.
Not at the moment – mostly because this type of logic should be pretty easy to implement using the existing interfaces and a simple custom recipe See this page for an example of the choice
interface. Depending on your label scheme, you can set "choice_style": "multiple"
in your config to allow multiple selection, or "choice_auto_accept": true
to automatically accept a task if an option was selected.
The only difference to the manual NER recipes is that a solution using the choice
interface will produce more “general” data – i.e. it won’t set the "label"
specifically and instead, list the options the user selected. For example, if your options have the IDs 1
, 2
and 3
and the user selects the first two, the task will include "accept": [1, 2]
. This should be fairly easy to convert, though. For example:
textcat_examples = []
for example in CHOICE_EXAMPLES: # the exported examples
for option_id in example.get('accept', []): # iterate over accepted ids
eg = dict(example) # copy task
eg['label'] = option_id # set label
textcat_examples.append(eg)
Thanks! To save a little time for others I’ll add my code:
import prodigy
from prodigy.components.loaders import JSONL
@prodigy.recipe('textcat_manual',
dataset=prodigy.recipe_args['dataset'],
file_path=("Path to texts", "positional", None, str),
label=prodigy.recipe_args['label'])
def textcat_manual(dataset, file_path,label):
"""Annotate the sentiment of texts using different mood options."""
stream = JSONL(file_path) # load in the JSONL file
stream = add_options(stream,label) # add options to each task
return {
'dataset': dataset, # save annotations in this dataset
'view_id': 'choice', # use the choice interface
'stream': stream,
}
def add_options(stream,label):
"""Helper function to add options to every task in a stream."""
options = [{'id':l ,'text': l} for l in label.split(',')]
for task in stream:
task['options'] = options
yield task
And you call the recipe like so(working from same folder as textcat_manual.py is held):
prodigy textcat_manual [db-name] [path to validation jsonl ] --label label1,label2,...,labeln -F textcat_manual.py
Then you have to clean up the annotations as Ines showed. The following built in prodigy functions are helpful:
from prodigy.components.db import connect
db = connect() # connect to the DB using the prodigy.json settings
textcat_examples = db.get_dataset(db-name)
# Ines code here
db.add_dataset(new_db_name)
db.add_examples(textcat_examples, datasets=[new_db_name])
Then you can use textcat.batch-train
with the flag --eval-id new_db_name