Hello,
I just purchased the Prodigy tool and am trying to set it up in a virtual machine that is connected to a secure database (we're dealing with sensitive data that we don't want to persist on the machine).
I read, in the custom loader docs, that you can write a loader to load data from a SQL query and pipe it to prodigy.
Here is the Python I used to do that
import boto3
import psycopg2
import argparse
import json
import sys
#Parse command line args
parser = argparse.ArgumentParser("Parse args to load data from remote RDS")
parser.add_argument("host", type=str)
parser.add_argument("db", type=str)
parser.add_argument("table", type=str)
args = parser.parse_args()
#Create db connection
pw = "some_password"
conn = psycopg2.connect(database=args.db, user="user", password=pw, host=args.host, port="5432")
cur = conn.cursor()
#Get all rows
q = '''SELECT * FROM ''' + str(args.table)
cur.execute(q)
rows = cur.fetchall()
#Output to stdout for Prodigy to read
for row in rows:
task = {"text": row[1]}
#I also tried this with print
sys.stdout.write(json.dumps(task))
Here is the command I run
python3 load_prodigy_data.py "host" "db_name" "table_name" | prodigy ner.manual new_db en_core_web_log
I tested out the data loading script without piping to prodigy and it worked fine.
Let me know what I'm missing
Alex