Textcat model with multiple classes

Hi! Glad to hear things have been going well so far :smiley:

Your current approach does indeed sound a bit complicated for what it is, and I'm sure there's an easier way to achieve the same result. Have you had a look at the textcat.manual recipe yet? It shows the text and options in a multiple-choice interface and the format you get out is directly compatible with textcat.teach.

A single annotation task in the choice format could look like this:

{
    "text": "Some text",
    "options": [
        {"id": "LABEL1", "text": "Label 1"},
        {"id": "LABEL2", "text": "Label 2"}
    ]
}

When you select an option, a key "accept" is added to the task and it holds a list of the selected IDs. For example: "accept": ["LABEL1", "LABEL2"]. You can also provide those when you load in the data to pre-select certain categories – e.g. based on your rules – and then correct them if needed.

For training, you might also consider training with spaCy directly – this gives you more flexibility and you get to tweak more settings, experiment with different architectures etc. See here for an example script.

Datasets in Prodigy hold the annotations you collect. There's typically no need to import raw data before you annotated – this can all be done on the command line when you start the recipe.

Datasets are append-only so you'll never lose any state or data. So if you want to manually edit examples in an existing dataset, you should export it, edit it and then import it to a new dataset. This creates more data overall – but it means you'll always be able to recover the previous dataset. We recommend creating a new dataset for every annotation experiment, annotation type etc. Merging datasets later is easy – there's a db-merge command and each example has hashes that let you find all annotations on the same input text. You can also think of a dataset as one unit of data you'd run a particular experiment with.