I am trying to get Prodigy to export a JSON file as a tokenized JSONL file with no annotations done on the data. I've attached an example of how I want the data to be outputted.
As an example, I've started with this examples.jsonl
file:
{"text": "my name is Sok"}
{"text": "my name is Noa"}
This is fed into an NER task.
python -m prodigy ner.manual issue-6018 blank:en examples.jsonl --label name
This gives me the following interface.
After annotating two examples, I can call db-out.
python -m prodigy db-out issue-6018
This yields:
{"text":"my name is Sok","_input_hash":1096766140,"_task_hash":-1686437232,"_is_binary":false,"tokens":[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Sok","start":11,"end":14,"id":3,"ws":false}],"_view_id":"ner_manual","spans":[{"start":11,"end":14,"token_start":3,"token_end":3,"label":"name"}],"answer":"accept","_timestamp":1665405458}
{"text":"my name is Noa","_input_hash":253661630,"_task_hash":-971655848,"_is_binary":false,"tokens":[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Noa","start":11,"end":14,"id":3,"ws":false}],"_view_id":"ner_manual","spans":[{"start":11,"end":14,"token_start":3,"token_end":3,"label":"name"}],"answer":"accept","_timestamp":1665405546}
Note that the
ner.manual
recipe has a side-effect that the tokens also appear. Not every recipe does this.
If I understand you correctly, you're only interested in the tokens? Then you could use jq from the commandline.
python -m prodigy db-out issue-6018 | jq -c ".tokens"
This yields the tokens as an array on each line.
[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Sok","start":11,"end":14,"id":3,"ws":false}]
[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Noa","start":11,"end":14,"id":3,"ws":false}]
If this is what you need, then that's grand. However, if you want to be more flexible, I might instead recommend writing a script that fetches the data that you need from Python. In particular, you'll need the get_dataset_examples-method.
Example
Here is the Python code that should help you get started. Note that you'll need to replace issue-6018
with your dataset name.
from prodigy.components.db import connect
# Connect to DB
db = connect()
# I'm using `db.get_dataset` which will be removed in the future
# in favor of the more explicit Database.get_dataset_examples.
examples = db.get_dataset("issue-6018")
print("\nThese are all the examples:\n")
print(examples)
print("\nThese are just the tokens: \n")
print([{'tokens': e['tokens']} for e in examples])