Getting io.TextIOWrapper instead of string when using --patterns

nikolay · October 17, 2018, 2:17pm

Hi, I created a custom Elastic Search loader. When I’m using it with single keyword like that:

prodigy elastic.textcat.teach my_dataset en_core_web_sm "bank account number" --label SENSITIVE`

it works just fine. However, when I’m trying to load a list of terms it fails:

prodigy elastic.textcat.teach my_dataset en_core_web_sm --patterns terms/sensitive_terms.json --label SENSITIVE

The reason is my loader function gets io.TextIOWrapper instead of str. Here is how the loader looks like:

def elastic_api_loader(source: str):

    es_config = get_config()['elastic_api']
    samples = es_config.get('samples', DEFAULT_TOP_N)

    handler = ElasticQueryHandler(samples)
    hits = handler.query(source)
    count = hits['total']

    for doc in hits['hits']:
       es_score = doc['_score']
       data = doc['_source']

       yield {
           'id': data.get('id'),
           'text': data.get('text'),
           'html': data.get('html'),
           'meta': {'query': source, 'es_score': es_score, 'doc_count': count},
       }

What is the proper way of handling patterns in the loader?

ines · October 18, 2018, 4:58pm

I think the problem in your command here is that you're not actually passing in a source argument (in the first one you do with "bank account number"). By default, if no source is specified, Prodigy will read from stdin, so you can pipe data forward to the recipe.

So basically, if you don't pass in a source, your recipe is waiting for something to be piped forward, and the text it reads from standard input is represented by io.TextIOWrapper.

nikolay · October 18, 2018, 6:19pm

Thank you. That’s indeed the issue, but it also means that I misunderstood the textcat.teach case.
What I’m trying to achieve is to mine the documents from the search engine given a list of terms and assign a label. How I can do this if I have to provide one fixed source keyword?

ines · October 18, 2018, 7:08pm

You don't necessarily have to do that – it's just how it's done by default, because the source is normally a file path to the input text. Something needs to stream in data – either a file, your custom loader or a previous process that you pipe forward.

One very simple solution would be to have the elastic_api_loader take a file path, so you could pass in sensitive_terms.json instead of "bank account number". In your code, you could then do something like this:

def elastic_api_loader(source: str):
     with open(source) as f:
        terms = json.loads(f.read())
    # etc.

Alternatively, you could also create a separate loader script that takes care of putting together your stream based on your keywords and optional other parameters (see here for examples). If your script takes command-line arguments (using a library like plac or click), you could then do something like this:

python elastic_script.py sensitive_terms.json | prodigy textcat.teach my_dataset en_core_web_sm --label SENSITIVE

nikolay · October 18, 2018, 9:03pm

I’ll definitely try these approaches. I wonder is there any way to put model in the loop to come up with better queries?

ines · October 19, 2018, 9:31am

What exactly are you trying to do? Do you want to just find better terms to search for, like “bank account number”? If so, you might want to do this in a separate step, using word embeddings and something similar to terms.teach: https://prodi.gy/docs/recipes#terms-teach

For multi-word queries, something like sense2vec might be helpful: https://github.com/explosion/sense2vec

Topic		Replies	Views
Create Custom Loader usage , ner , custom	21	3870	August 14, 2019
textcat.teach splitting text stream textcat , solved	2	537	May 23, 2018
prodigy v1.8 start the web app from Python for recipe textcat.manual usage , textcat , solved	8	559	May 13, 2021
textcat.teach - Patterns not filtering Label enhancement , textcat , done , solved	8	744	January 11, 2019
Prodigy recipe on your github page appears to not work. Out of date? usage , terms , solved	3	571	February 17, 2020

Getting io.TextIOWrapper instead of string when using --patterns

Related topics