I’m trying to build a text classifier with multiple labels, e.g. ‘wonen’ (housing) and ‘economie’. We have manual data from someone else that we trust somewhat that I used to create an initial spacy model.
What works really well and efficiently is using textcat.teach to create examples, working one label at the time. The input is raw text, e.g. a ndjson/jsonl file with lines containing {"text": "...", "meta": {"id": "..."}}
, and after running it for multiple labels the db-out
lists the same text for each label with associated accept/reject decisions.
Now, we want to create a new set for evaluation. We want this to be totally independent of the created model, as just accept/rejecting model suggestions would bias the evaluation set towards the model output. I would have expected to be able to use textcat-eval with this, but for some reason it requires the input data to already have an associated label, whether I use the pre-trained model or an empty model (nl_core_news_sm
).
If I add a label (e.g. “economie”) to the data file and then call textcat-eval, it works fine; but of course this only codes a single label which means I would need to create a separate input file for each label. I do get an error message on exiting the server (Buffer and memoryview are not contiguous in the same dimension.
) but the data seem to be stored properly.
If I create an input file with multiple labels on different lines and specify that I only want to code e.g. economie with the -l economie
option I still get the ‘wonen’ example presented as well. Moreover, on quitting the server I get an error message (KeyError: 'wonen'
) and the example with wonen
is stored as a decision on wonen
, seemingly ignoring the -l
option.
See ( https://gist.github.com/vanatteveldt/6af3b7ef8a7d22f87ddda828e162fa81) for a session log with the full commands and stack traces.
Questions:
-
Is textcat-eval a good way to create a “gold standard” independent of the existing models? Topic #533 (which I’m not allowed to link here?) seems to suggest it might not be, but if I run it without any existing model it should give an unbiased evaluation set, right? Is there a better way to code a gold standard set with prodigy?
-
Is it a bug that the textcat-eval requires a
label
property in the json even if--label
is specified, and ignores the option anyway? -
Should I create a separate input file for each label, differing only in the
label
property? Or is there a better way to do this?
Thanks again!
– Wouter