I’m using textcat.teach the following way, using sys.stdin:
class TextIterable(ElasticsearchIterable):
def __init__(self, query=None):
super().__init__(query=query, elasticsearch_host='host', index='index', doc_type='doc_type', all_docs=True)
if __name__ == '__main__':
texts = TextIterable(query={"query": {"query_string": {"default_field": "arquivo", "query": "query"}}})
for text in texts:
text_body = text['body']
for paragraph in text_body.split('\n'):
paragraph_json = {'text': paragraph}
print(json.dumps(paragraph_json))
My problem is that when I run this script, printing the texts I’m waiting for to annotate at Prodigy, somewhere along the way the text is still being split using some unkown criteria, yielding only a few words at a time. I would like to get exactly what is expected from the script, which is each paragraph extracted from the documents. I’m using the following command:
prodigy textcat.teach decisions /models/pt_glove_vectors --label DECISION --patterns decisions_patterns.jsonl
How can I achieve this for the textcat.teach recipe?
Thanks!