Reduce the number of categories in textcat project

Hi there, i'm working on a textcat project with three categories. I almost reached 2000 annotations in my training set and realized that:

  • it's quite difficult to distinguish between categories 1 and 3

  • Category n°3 is making the dataset very imbalanced ( ~ 60 annotations in the whole dataset)

The general performance launching train is .60, but if I export the annotations, remove class n° 3 and train a random forest model the accuracy jumps to .87. So, Considering point n° 1 - and generally the purpose of this classifier - how can I convert this project so that it merges annotation from category 3 to category 1?
Hope I made myself clear, thanks for the help!

how can I convert this project so that it merges annotation from category 3 to category 1?

It depends a bit on how you annotated the data (did you use a binary yes/no interface for each classes vs. a choice interface with 3 classes annotated per example) but this sounds like it's most easily done with a Python script. Here's a starter:

from prodigy.components.db import connect

db = connect()

# Fetch a list of dictionaries that you can for-loop over
dataset = db.get_dataset_examples("<dataset name>")

I would loop over all the examples and remove the annotation that you're not interested in. You could then save this subset into a new dataset in Prodigy via the db-in recipe and continue working from there.

Would this work?

train a random forest model the accuracy jumps to .87

Out of curiosity, any reason in particular why you've opted for a random forest? Are you using countvectors with it in sklearn? If so, it might be worth trying out a simple logistic regression too. In my experience that performs very well too.

1 Like

As usual, thanks a lot for your support. I actually used three ways to annotate the data: openai suggestions, textcat-teach and textcat correct.

I tried your method and used this code to create a new json with only two categories:

old_list = dataset  # the dataset
new_list = []

for d in old_list:
    if 'accept' in d and ('cat-1' in d['accept'] or 'cat-2' in d['accept']):
        new_dict = {'text': d['text'], 'accept': d['accept']}
        new_list.append(new_dict)
#this way I obtained a list of 1378 dictionaries, cat-3 was just 58 entries so I guess all the others were "skip"

#then to export the new list of dictionaries
import json

# Export the new_list as a JSON file
with open('dataset_2cat.json', 'w') as f:
    json.dump(new_list, f)

then I used this to import to prodigy

prodigy db-in ineqi_2cat dataset_2cat.json
✔ Created dataset 'ineqi_2cat' in database SQLite
✔ Imported 1378 annotations to 'ineqi_2cat' (session
2023-05-01_18-02-20) in database SQLite

So far so good, it seems. But when I launch prodigy train to see the results with only the two categories I get this result:

prodigy train --textcat ineqi_2cat ./model2cat
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0

========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config

=========================== Initializing pipeline ===========================
[2023-05-01 18:03:38,746] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components
  - [textcat] Training: 1103 | Evaluation: 275 (20% split)
Training: 0 | Evaluation: 0
Labels: textcat (0)
[2023-05-01 18:03:38,785] [INFO] Pipeline: ['textcat']
[2023-05-01 18:03:38,786] [INFO] Created vocabulary
[2023-05-01 18:03:38,787] [INFO] Finished initializing nlp object
Traceback (most recent call last):
  File "/Users/aleportatile/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/aleportatile/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/prodigy/recipes/train.py", line 289, in train
    return _train(
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/prodigy/recipes/train.py", line 201, in _train
    nlp = spacy_init_nlp(config, use_gpu=gpu_id)
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/spacy/training/initialize.py", line 84, in init_nlp
    nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/spacy/language.py", line 1308, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/spacy/pipeline/textcat.py", line 373, in initialize
    validate_get_examples(get_examples, "TextCategorizer.initialize")
  File "spacy/training/example.pyx", line 64, in spacy.training.example.validate_get_examples
TypeError: [E930] Received invalid get_examples callback in `TextCategorizer.initialize`. Expected function that returns an iterable of Example objects but got: []

what could it be?

Out of curiosity, any reason in particular why you've opted for a random forest? Are you using countvectors with it in sklearn? If so, it might be worth trying out a simple logistic regression too. In my experience that performs very well too.

It was mainly curiosity, as I already have an R script with an implementation of Random forest over several stylometric features (if you are interested check out Mikros & Perifanos, 2013). I'll check out logistic regression.
EDIT: Interestingly, the Logistic Regression performance over these 1378 annotations is .76, lower than RF with AMNP procedure.

Your error message suggests that you've got an empty train set and I think I see why. The docs of the db-in command state that you must provide the data in jsonline.jsonl format, not .json. In .jsonl you need to have a dictionary on every line.

I think if you replace this:

#then to export the new list of dictionaries
import json

# Export the new_list as a JSON file
with open('dataset_2cat.json', 'w') as f:
    json.dump(new_list, f)

with this:

import srsly

srsly.write_jsonl("dataset_2cat.jsonl", new_list)

it should all work again.

This code correctly exports the annotations as a jsonl, however when i import using the db-in recipe i get this output:

✔ Created dataset 'dataset_2cats' in database SQLite
✔ Imported 1378 annotations to 'dataset_2cats' (session
2023-05-03_12-13-45) in database SQLite
Found and keeping existing "answer" in 0 examples

And if I run the train command:

prodigy train --textcat ineqi_2cats
TypeError: [E930] Received invalid get_examples callback in `TextCategorizer.initialize`. Expected function that returns an iterable of Example objects but got: []

Could you share the command that you ran? Did you also add the --answer accept arugment to db-in? If that doesn't work, could you share two examples of your data so that I might try and reproduce locally?