Trouble creating evaluation set with textcat.eval

I’m trying to build a text classifier with multiple labels, e.g. ‘wonen’ (housing) and ‘economie’. We have manual data from someone else that we trust somewhat that I used to create an initial spacy model.

What works really well and efficiently is using textcat.teach to create examples, working one label at the time. The input is raw text, e.g. a ndjson/jsonl file with lines containing {"text": "...", "meta": {"id": "..."}}, and after running it for multiple labels the db-out lists the same text for each label with associated accept/reject decisions.

Now, we want to create a new set for evaluation. We want this to be totally independent of the created model, as just accept/rejecting model suggestions would bias the evaluation set towards the model output. I would have expected to be able to use textcat-eval with this, but for some reason it requires the input data to already have an associated label, whether I use the pre-trained model or an empty model (nl_core_news_sm).

If I add a label (e.g. “economie”) to the data file and then call textcat-eval, it works fine; but of course this only codes a single label which means I would need to create a separate input file for each label. I do get an error message on exiting the server (Buffer and memoryview are not contiguous in the same dimension.) but the data seem to be stored properly.

If I create an input file with multiple labels on different lines and specify that I only want to code e.g. economie with the -l economie option I still get the ‘wonen’ example presented as well. Moreover, on quitting the server I get an error message (KeyError: 'wonen') and the example with wonen is stored as a decision on wonen, seemingly ignoring the -l option.

See ( https://gist.github.com/vanatteveldt/6af3b7ef8a7d22f87ddda828e162fa81) for a session log with the full commands and stack traces.

Questions:

  1. Is textcat-eval a good way to create a “gold standard” independent of the existing models? Topic #533 (which I’m not allowed to link here?) seems to suggest it might not be, but if I run it without any existing model it should give an unbiased evaluation set, right? Is there a better way to code a gold standard set with prodigy?

  2. Is it a bug that the textcat-eval requires a label property in the json even if --label is specified, and ignores the option anyway?

  3. Should I create a separate input file for each label, differing only in the label property? Or is there a better way to do this?

Thanks again!
– Wouter

Hi and sorry about the per-post link limit! (It's mostly for spam bot protection, and since we've already had issues with spam bots before so I'm scared to turn off this setting :stuck_out_tongue_winking_eye:)

textcat.eval is mostly intended as a quick tool for performing live evaluations of existing models. It lets you answer questions like "How would my model perform on this new data?", without having to create a full evaluation set first and then running a separate evaluation. Instead, the recipe uses the model, runs it over the new text and asks you whether the predictions are correct. This way way, Prodigy can immediately show you the results when you exit the session. Here's the link to the thread you mentioned, which explains this in more detail:

That said, the recipe is mostly intended for evaluations you do during development. Once you're ready to perform a full, repeatable and gold-standard evaluation, you usually want to create a new set manually. The nice thing is that if you use Prodigy's textcat.batch-train recipe, you can evaluate from the same binary annotation style – so a set of examples with "accept" and "reject"
annotations.

I'd suggest starting with the mark recipe, which takes a stream, and optional label and the name of an annotation interface, and will show you whatever comes in in exactly that order. So you can do something like this, pass in your label wonen and annotate whether the label applies to the text or not:

prodigy mark your_eval_set your_data.jsonl --label wonen --view-id classification

For each label you want to annotate, you can then start a new session over the same data, and add it to your evaluation set. It might sound unintuitive at first, but we've found that it's often faster and more efficient to make several passes over the data and annotate it once for each label. Your brain gets to focus on one label and concept at a time, you won't have to click as much (because you're only saying yes or no), and you'll end up with one binary decision for each label on each text, which you can evaluate your model on later.

(If you do want to solve the gold-standard annotation differently and do it all in one – for example, if you have too many labels – you could also create a custom evaluation recipe using the choice interface. The selected labels will then be stored like "accept": ["wonen"], so if you want to run your evaluation within Prodigy, you'll have to convert the data to the "label": "wonen" format. You can find an example of the recipe and workflow in the "Quickstart" section at the bottom of this page.)

Quick note, also in case others come across this thread later: It's important to keep in mind that spaCy's text classifier assumes that the categories are not mutually exclusive, so this will also be the basis of the built-in evaluation when you pass an --eval-id dataset to textcat.batch-train.

Let me have a look at this! The label should specify the category you want to annotate, and only predictions for that category should be shown. Prodigy should add those to the data as it comes in, so you shouldn't have to add any labels to the data in advance.

Excellent, thanks for the quick reply! We’ll just use mark in that case. And we agree with the one-label per pass strategy, I think it makes a lot of sense.

1 Like