I am trying to convert an existing classifier based on fasttext to spacy, mostly because it is a much easier library to ship/distribute, and is well integrated with Prodigy (because annotations matters, and need to become part of a regular workflow).
I am witnessing pretty low accuracy so far, and I am wondering if it is because I missed something in the way spacy/prodigy work in tandem. Here is what I have tried so far:
- took a 60k record "gold" dataset and imported it into a Prodigy Dataset via db-in (and confirmed all the 43 labels are present, and all marked as "accept")
- automated an "active-learner" to connect via the api to prodigy
textcat.teachrecipe in order to generate a healthy amount of "reject" for the examples that prodigy selects having the lowest confidence. I stop this process when I see I have about and 50/50 split in my dataset between accepted/rejected samples. It amounts then to ~120k record dataset
prodigy textcat.batch-train mydataset --n-iter 10 --batch-size 1000 --dropout 0.2 -E(because it is a multi-class problem, not multi-label, the classes are mutually exclusive)
The current results are the following:
Using 20% of examples (24103) for evaluation Using 100% of remaining examples (96414) for training Dropout: 0.2 Batch size: 1000 Iterations: 20 # LOSS F-SCORE ACCURACY 01 0.000 0.000 0.496 02 0.000 0.093 0.511 03 0.000 0.093 0.511 04 0.000 0.078 0.508 05 0.000 0.093 0.511 06 0.000 0.078 0.508 07 0.000 0.078 0.508 08 0.000 0.093 0.511 09 0.000 0.093 0.511 10 0.000 0.093 0.511
Now, I have tried with a smaller amount of classes (2, 3 and 4) on a subset of the data and it works much better in this case (between 0.85-0.95 accuracy), but it is because the problem is obviously much easier. Is there anything else that other users of Prodigy/Spacy have noticed when dealing with multi-class classification problems, where the number of classes is not small (>20). Should I look into trying a more custom approach in Spacy versus leveraging the built in
Thank you in advance for your guidance!