We’d like to stream examples from a postgresql db for text classification. Are there any code examples for custom streaming functions?
Many thanks!
We’d like to stream examples from a postgresql db for text classification. Are there any code examples for custom streaming functions?
Many thanks!
I’m not Postrges expert, so I’m not sure I can suggest a super specific code example – but here’s a more general example of a custom loader script (also see this docs section for more details):
# load_data.py
from pathlib import Path
import json
data_path = Path('/path/to/directory')
for file_path in data_path.iterdir(): # iterate over directory
lines = Path(file_path).open('r', encoding='utf8') # open file
for line in lines:
task = {'text': line} # create one task for each line of text
print(json.dumps(task)) # dump and print the JSON
If no source argument (file path etc.) is set on the command line, it will default to sys.stdin
. This lets you pipe data forward from a different process, like a custom script:
python load_data.py | prodigy ner.teach your_dataset en_core_web_sm
I’m pretty sure you’ll find a Python library or open-source project that provides some helpers for efficiently loading and streaming the data from your database – or maybe you already have the code for this in place. Prodigy’s streams are just regular Python generators, so you won’t have to fetch all your data at once and can make batch requests, and yield the texts out as they come in.
That really helped! After a crash-course in peewee, I created a separate script that extracted my examples, formatted as json per the example and piped the result into prodigy. Worked like a charm.
Many thanks!