This is our first time using prodigy, so there are probably some stupid questions below...
We want to train a model to classify texts into multiple (around 10) issue categories. We have 1200 data hand coded (within an external application). We thought that the basic setup should be something like:
- Import the 1200 manual codings.
- Check performance of the classification with textcat.batch-train and/or textcat.train-curve.
- Assuming the model isn't good enough, add more data with textcat.teach from unannotated data.
- Repeat steps 2 and 3 until happy.
Question 1: Does this sound like the right way to use prodigy?
Question 2: We tried doing steps 1 and 2, but the performance is immediately ~100%. It seems like the model tries to predict accept/reject rather than our issue categoires. See prodigy.sh ¡ GitHub for the code we used.
Are we using the wrong commands?
Question 3 (probably related to 2): we are inputting the data like so:
{"text":"Podcast van 28 februari[...] vermissing, VVD","label":"wonen","meta": {"":"1","id":"188159157","medium":"1almere"}}
Is that the correct format, given that the target label is 'wonen'?
(never mind the silly "": "1", which are R rownames, but they don't seem to cause the problem)
Question 4: We also have a dictionary of terms for each issue, and a structural topic model trained with topics that correspond (somewhat) with the target issues identified. Does it make sense to somehow input these into the initial model as well, and how would we do this?
Question 5: Do we need to specify the spacy model ("nl"?) and/or indicate what we think are good features?