textcat.teach splitting text stream

I’m using textcat.teach the following way, using sys.stdin:

class TextIterable(ElasticsearchIterable):

    def __init__(self, query=None):
        super().__init__(query=query, elasticsearch_host='host', index='index', doc_type='doc_type', all_docs=True)

if __name__ == '__main__':
    texts = TextIterable(query={"query": {"query_string": {"default_field": "arquivo", "query": "query"}}})
    for text in texts:
        text_body = text['body']
        for paragraph in text_body.split('\n'):
            paragraph_json = {'text': paragraph}
            print(json.dumps(paragraph_json))

My problem is that when I run this script, printing the texts I’m waiting for to annotate at Prodigy, somewhere along the way the text is still being split using some unkown criteria, yielding only a few words at a time. I would like to get exactly what is expected from the script, which is each paragraph extracted from the documents. I’m using the following command:

prodigy textcat.teach decisions /models/pt_glove_vectors --label DECISION --patterns decisions_patterns.jsonl

How can I achieve this for the textcat.teach recipe?

Thanks!

Hmm, that’s strange! You verified that the output of your script generates the correct texts, right?

The textcat recipes shouldn’t split the text or the individual sentences at all, so this is very confusing. The ner recipes do, by default, to ensure better performance of the beam search algorithm, which needs to predict all possible parses – but you can also turn this off on the command line.

Do you have an example of the text and the output? You can also run the command with the environment variable PRODIGY_LOGGING=verbose, which will output all the data that’s passing through the app, so you can inspect it.

Hi @ines!

I tried running with a single document and it printed 100% according to the paragraphs, so I think it must have been some other document that contains the same sequence but with a different paragraph split. I’ll add some debug information to the json printed to the output in case I run into this again.

Thanks!

1 Like