As usual, thanks a lot for your support. I actually used three ways to annotate the data: openai suggestions, textcat-teach and textcat correct.
I tried your method and used this code to create a new json with only two categories:
old_list = dataset # the dataset
new_list = []
for d in old_list:
if 'accept' in d and ('cat-1' in d['accept'] or 'cat-2' in d['accept']):
new_dict = {'text': d['text'], 'accept': d['accept']}
new_list.append(new_dict)
#this way I obtained a list of 1378 dictionaries, cat-3 was just 58 entries so I guess all the others were "skip"
#then to export the new list of dictionaries
import json
# Export the new_list as a JSON file
with open('dataset_2cat.json', 'w') as f:
json.dump(new_list, f)
then I used this to import to prodigy
prodigy db-in ineqi_2cat dataset_2cat.json
✔ Created dataset 'ineqi_2cat' in database SQLite
✔ Imported 1378 annotations to 'ineqi_2cat' (session
2023-05-01_18-02-20) in database SQLite
So far so good, it seems. But when I launch prodigy train to see the results with only the two categories I get this result:
prodigy train --textcat ineqi_2cat ./model2cat
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0
========================= Generating Prodigy config =========================
ℹ Auto-generating config with spaCy
✔ Generated training config
=========================== Initializing pipeline ===========================
[2023-05-01 18:03:38,746] [INFO] Set up nlp object from config
Components: textcat
Merging training and evaluation data for 1 components
- [textcat] Training: 1103 | Evaluation: 275 (20% split)
Training: 0 | Evaluation: 0
Labels: textcat (0)
[2023-05-01 18:03:38,785] [INFO] Pipeline: ['textcat']
[2023-05-01 18:03:38,786] [INFO] Created vocabulary
[2023-05-01 18:03:38,787] [INFO] Finished initializing nlp object
Traceback (most recent call last):
File "/Users/aleportatile/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/aleportatile/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 374, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/plac_core.py", line 232, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/prodigy/recipes/train.py", line 289, in train
return _train(
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/prodigy/recipes/train.py", line 201, in _train
nlp = spacy_init_nlp(config, use_gpu=gpu_id)
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/spacy/training/initialize.py", line 84, in init_nlp
nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer)
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/spacy/language.py", line 1308, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "/Users/aleportatile/miniconda3/lib/python3.10/site-packages/spacy/pipeline/textcat.py", line 373, in initialize
validate_get_examples(get_examples, "TextCategorizer.initialize")
File "spacy/training/example.pyx", line 64, in spacy.training.example.validate_get_examples
TypeError: [E930] Received invalid get_examples callback in `TextCategorizer.initialize`. Expected function that returns an iterable of Example objects but got: []
what could it be?
Out of curiosity, any reason in particular why you've opted for a random forest? Are you using countvectors with it in sklearn? If so, it might be worth trying out a simple logistic regression too. In my experience that performs very well too.
It was mainly curiosity, as I already have an R script with an implementation of Random forest over several stylometric features (if you are interested check out Mikros & Perifanos, 2013). I'll check out logistic regression.
EDIT: Interestingly, the Logistic Regression performance over these 1378 annotations is .76, lower than RF with AMNP procedure.