textcat.eval throw UnicodeError on Windows

I am running prodigy 1.5.1 on Windows 10 with Python 3.5.5 in a conda environment. Using the a custom choice recipe and textcat.teach, textcat.batch-train and the like works perfectly. However, trying to run textcat.eval throws this error:

11:20:02 - RECIPE: Calling recipe 'textcat.eval'
Using 7 labels: Building scale locally, Building scale internationally, Acquiring digital capabilities, New business model, Add new services, Not applicable, Other reasons(Residual)
11:20:02 - RECIPE: Starting recipe textcat.eval
11:21:07 - LOADER: Using file extension 'jsonl' to find loader
11:21:07 - LOADER: Loading stream from jsonl
11:21:07 - RECIPE: Initialised TextClassifier with model model
11:21:07 - CONTROLLER: Initialising from recipe
11:21:07 - VALIDATE: Creating validator for view ID 'classification'
11:21:07 - DB: Initialising database SQLite
11:21:07 - DB: Connecting to database SQLite
11:21:07 - DB: Loading dataset 'taught' (9 examples)
11:21:07 - DB: Creating dataset '2018-09-12_11-21-07'
11:21:07 - CONTROLLER: Validating the first batch
11:21:07 - CONTROLLER: Iterating over stream
Traceback (most recent call last):
  File "C:\Users\henning.lebbaeus\AppData\Local\Continuum\Miniconda3\envs\prodigy\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
  File "C:\Users\henning.lebbaeus\AppData\Local\Continuum\Miniconda3\envs\prodigy\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
  File "C:\Users\henning.lebbaeus\AppData\Local\Continuum\Miniconda3\envs\prodigy\lib\site-packages\prodigy\__main__.py", line 259, in <module>
controller = recipe(*args, use_plac=True)
  File "cython_src\prodigy\core.pyx", line 178, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src\prodigy\core.pyx", line 55, in prodigy.core.Controller.__init__
  File "C:\Users\henning.lebbaeus\AppData\Local\Continuum\Miniconda3\envs\prodigy\lib\site-packages\toolz\itertoolz.py", line 368, in first
return next(iter(seq))
  File "cython_src\prodigy\core.pyx", line 97, in iter_tasks
  File "cython_src\prodigy\components\validate.pyx", line 72, in prodigy.components.validate.Validator.check
  File "cython_src\prodigy\util.pyx", line 612, in prodigy.util.prints
  File "cython_src\prodigy\util.pyx", line 643, in prodigy.util._locale_escape
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 824: invalid start byte

Loading textcat.teach with the same input file works. Any ideas what’s going on here?

Hmm, at first glance, it looks like the error is caused by the output of the validator – so, the JSON schema validator wants to output a validation error (because something in the first batch of the incoming stream doesn’t validate), but something within that error can’t be decoded properly in the print/locale helper :thinking:

Could you share more details on the command you’re running and how you’ve integrated the custom choice recipe? And is there anything in the first batch of your data (by default, 10 examples) that looks suspicious?

The choice recipe just adds the options for a multiple-choice labelling task (like described here. We used it to bootstrap the annotations. As I said, the same JSONL file works perfectly with textcat.teach.

Here is the command is use to run textcat.eval:
python -m prodigy textcat.eval taught model test.jsonl -l "Building scale locally","Building scale internationally","Acquiring digital capabilities","New business model","Add new services","Not applicable","Other reasons(Residual)"

model is a model saved from training with textcat.batch-train.

Sample from input file:
`

{“meta”:{“deal_ID”:633861},“text”:“EIB Capital Crop., the US-based holding company of Mr. Sai Yung Chung, a China-based individual having interest in banking sector, has agreed to acquire Eastern International Bank (Eastern), the US-based bank, for an approximate cash consideration of USD 31.9m.\nUnder the terms of the agreement, Eastern shareholders will receive approximately USD 60.23, subject to downward adjustment based on the performance of Eastern, in exchange for each share of Eastern common stock. Post acquisition, Eastern will be wholly-owned by Mr. Chung.\nEastern reported consolidated assets of approximately USD 119.47m and had loans of USD 78.378m as 31 December 2015. The transaction is subject to Eastern’s regulators and shareholders\u2019 approval and is expected to be completed in the fall of 2016.”}

`

Thanks! This is really mysterious, because the textcat.eval recipe really doesn’t do anything special and it’s really very similar to the other recipes. And once the recipe is executed, it triggers the same processes as textcat.teach.

Two ideas to help with debugging this:

  • Does the same thing happen when you set "validate": false in your prodigy.json?
  • This is only speculative, but I just had a look at the recipe source and compared textcat.eval and textcat.teach to see if there were any differences in the way the stream was orchestrated. The only one I found was:
# textcat.eval
stream = get_stream(source, api, loader)
# textcat.teach
stream = get_stream(source, api, loader, rehash=True, dedup=True,
                    input_key='text')

If you change this in your recipes/textcat.py, does that make a difference?

Well, the server at least starts and the app renders, however the labels are missing. E.g. I see a blank blue box above the text where I would expect the label to be (similar to textcat.teach, right?)

This did not make a difference.

I think I found the source of the UnicodeError. It has to do with this. Maybe it did not pop up in textcat.teach because teach uses a sorter? So by chance it did not stumble upon these cases yet?

But then I received a new error:

python -m prodigy textcat.eval taught model test.jsonl -l "Building scale locally","Building scale internationally","Acquiring digital capabilities","New business model","Add new services","Not applicable","Other reasons(Residual)"
    14:01:02 - RECIPE: Calling recipe 'textcat.eval'
    Using 7 labels: Building scale locally, Building scale internationally, Acquiring digital capabilities, New business model, Add new services, Not applicable, Other reasons(Residual)
    14:01:02 - RECIPE: Starting recipe textcat.eval
    14:02:11 - LOADER: Using file extension 'jsonl' to find loader
    14:02:11 - LOADER: Loading stream from jsonl
    14:02:11 - LOADER: Rehashing stream
    14:02:11 - RECIPE: Initialised TextClassifier with model model
    14:02:11 - CONTROLLER: Initialising from recipe
    14:02:11 - VALIDATE: Creating validator for view ID 'classification'
    14:02:11 - DB: Initialising database SQLite
    14:02:11 - DB: Connecting to database SQLite
    14:02:11 - DB: Loading dataset 'taught' (9 examples)
    14:02:11 - DB: Creating dataset '2018-09-12_14-02-11'
    14:02:11 - CONTROLLER: Validating the first batch
    14:02:11 - CONTROLLER: Iterating over stream
    14:02:11 - FILTER: Filtering duplicates from stream
    14:02:11 - FILTER: Filtering out empty examples for key 'text'

    ?  ERROR: Invalid task format for view ID 'classification'
    'label' is a required property

    {'text': "EIB Capital Crop., the US-based holding company of Mr. Sai Yung Chung, a China-based individual having interest in banking sector, has agreed to acquire Eastern International Bank (Eastern), the US-based bank, for an approximate cash consideration of USD 31.9m.\nUnder the terms of the agreement, Eastern shareholders will receive approximately USD 60.23, subject to downward adjustment based on the performance of Eastern, in exchange for each share of Eastern common stock. Post acquisition, Eastern will be wholly-owned by Mr. Chung.\nEastern reported consolidated assets of approximately USD 119.47m and had loans of USD 78.378m as 31 December 2015. The transaction is subject to Eastern's regulators and shareholders' approval and is expected to be completed in the fall of 2016.", 'meta': {'deal_ID': 633861}, '_input_hash': -400490899, '_task_hash': 1343939113}

Looking at the source code in textcat.py, the teach recipe contains this line:

# textcat.teach
stream = prefer_uncertain(predict(stream))

The eval recipe does note contain anything like that. Is it missing the call to predict(stream) (without sorting) ?

Ah, thanks for getting to the bottom of this – looks like we need a better way for dealing with those unicode specifics in our error messages then! And it seems that the labels were defintiely the problem and what the validator was trying to complain about was that the data didn’t have any labels attached.

I just had a look at the textcat.eval recipe and the way it works is currently a little unintuitive and not 100% consistent with ner.eval. Instead of asking the model for its predictions and adding the labels, it currently expects the data to already have labels assigned and will then compare the annotations with the model’s predictions later.

I suspect the reason that recipe ended up this way is because it’s more difficult for text classification to select the examples to ask the user about: Should you just take the ones with the highest scores? All examples for all labels? A mix? For NER, that was an easier decision, because the recipe could simply show the user all entities in doc.ents.

We should definitely make this consistent for the next release, though. From the perspective of your use case, how would you expect the selection to work? Would you expect to see all examples with all labels, or maybe even a random selection (labels with high scores and labels with low scores etc.)?

In the meantime, you could add a function like this that wraps your stream and adds labels:

from prodigy.util import set_hashes

def add_labels_to_stream(stream):
    for eg in stream:
        # add your label here, process the eg['text'] with the 
        # model if necessary etc.
        eg['label'] = 'xxx'
        eg = set_hashes(eg)  # rehash, just in case
        yield eg

stream = add_labels_to_stream(stream)

I think I finally understand now. What I was expecting from textcat.eval is the behavior of textcat.teach, just without model updates and sorting. It was just not clear that the labels should be provided by the user. Given that, I think there is not really any need for loading the model in textcat.eval. Or am I missing something?

How to sort the examples shown depends highly on the strategy for building a good evaluation set. Which in turn depends on the problem and therefore it perfectly makes sense to leave this to the user to decide.

Anyway, may I suggest to update the documentation to reflect this?

Thanks a lot for your extremely fast support! :+1: