textcat.teach splitting text stream

pvcastro · May 23, 2018, 7:29pm

I’m using textcat.teach the following way, using sys.stdin:

class TextIterable(ElasticsearchIterable):

    def __init__(self, query=None):
        super().__init__(query=query, elasticsearch_host='host', index='index', doc_type='doc_type', all_docs=True)

if __name__ == '__main__':
    texts = TextIterable(query={"query": {"query_string": {"default_field": "arquivo", "query": "query"}}})
    for text in texts:
        text_body = text['body']
        for paragraph in text_body.split('\n'):
            paragraph_json = {'text': paragraph}
            print(json.dumps(paragraph_json))

My problem is that when I run this script, printing the texts I’m waiting for to annotate at Prodigy, somewhere along the way the text is still being split using some unkown criteria, yielding only a few words at a time. I would like to get exactly what is expected from the script, which is each paragraph extracted from the documents. I’m using the following command:

prodigy textcat.teach decisions /models/pt_glove_vectors --label DECISION --patterns decisions_patterns.jsonl

How can I achieve this for the textcat.teach recipe?

Thanks!

ines · May 23, 2018, 7:42pm

Hmm, that’s strange! You verified that the output of your script generates the correct texts, right?

The textcat recipes shouldn’t split the text or the individual sentences at all, so this is very confusing. The ner recipes do, by default, to ensure better performance of the beam search algorithm, which needs to predict all possible parses – but you can also turn this off on the command line.

Do you have an example of the text and the output? You can also run the command with the environment variable PRODIGY_LOGGING=verbose, which will output all the data that’s passing through the app, so you can inspect it.

pvcastro · May 23, 2018, 8:00pm

Hi @ines!

I tried running with a single document and it printed 100% according to the paragraphs, so I think it must have been some other document that contains the same sequence but with a different paragraph split. I’ll add some debug information to the json printed to the output in case I run into this again.

Thanks!

Topic		Replies	Views
prodigy splitting sentences for annotation enhancement , usage , done	14	3446	December 12, 2019
textcat.teach to show all the docs in stream, despite their score textcat , spacy	5	578	August 7, 2018
Filter inputs in textcat.teach streams	1	347	March 11, 2023
Textcat model with multiple classes usage , textcat	5	1535	November 1, 2019
Best use of `textcat.teach` usage , textcat	2	1430	June 18, 2020

textcat.teach splitting text stream

Related topics