Getting io.TextIOWrapper instead of string when using --patterns

Hi, I created a custom Elastic Search loader. When I’m using it with single keyword like that:

prodigy elastic.textcat.teach my_dataset en_core_web_sm "bank account number" --label SENSITIVE`

it works just fine. However, when I’m trying to load a list of terms it fails:

prodigy elastic.textcat.teach my_dataset en_core_web_sm --patterns terms/sensitive_terms.json --label SENSITIVE

The reason is my loader function gets io.TextIOWrapper instead of str. Here is how the loader looks like:

def elastic_api_loader(source: str):

    es_config = get_config()['elastic_api']
    samples = es_config.get('samples', DEFAULT_TOP_N)

    handler = ElasticQueryHandler(samples)
    hits = handler.query(source)
    count = hits['total']

    for doc in hits['hits']:
       es_score = doc['_score']
       data = doc['_source']

       yield {
           'id': data.get('id'),
           'text': data.get('text'),
           'html': data.get('html'),
           'meta': {'query': source, 'es_score': es_score, 'doc_count': count},
       }

What is the proper way of handling patterns in the loader?

I think the problem in your command here is that you're not actually passing in a source argument (in the first one you do with "bank account number"). By default, if no source is specified, Prodigy will read from stdin, so you can pipe data forward to the recipe.

So basically, if you don't pass in a source, your recipe is waiting for something to be piped forward, and the text it reads from standard input is represented by io.TextIOWrapper.

Thank you. That’s indeed the issue, but it also means that I misunderstood the textcat.teach case.
What I’m trying to achieve is to mine the documents from the search engine given a list of terms and assign a label. How I can do this if I have to provide one fixed source keyword?

You don't necessarily have to do that – it's just how it's done by default, because the source is normally a file path to the input text. Something needs to stream in data – either a file, your custom loader or a previous process that you pipe forward.

One very simple solution would be to have the elastic_api_loader take a file path, so you could pass in sensitive_terms.json instead of "bank account number". In your code, you could then do something like this:

def elastic_api_loader(source: str):
     with open(source) as f:
        terms = json.loads(f.read())
    # etc.

Alternatively, you could also create a separate loader script that takes care of putting together your stream based on your keywords and optional other parameters (see here for examples). If your script takes command-line arguments (using a library like plac or click), you could then do something like this:

python elastic_script.py sensitive_terms.json | prodigy textcat.teach my_dataset en_core_web_sm --label SENSITIVE

I’ll definitely try these approaches. I wonder is there any way to put model in the loop to come up with better queries?

What exactly are you trying to do? Do you want to just find better terms to search for, like “bank account number”? If so, you might want to do this in a separate step, using word embeddings and something similar to terms.teach: https://prodi.gy/docs/recipes#terms-teach

For multi-word queries, something like sense2vec might be helpful: https://github.com/explosion/sense2vec