Is it possible to make Prodigy export a Tokenized JSONL file by inputting a JSON file with no annotations done on the dataset?

I am trying to get Prodigy to export a JSON file as a tokenized JSONL file with no annotations done on the data. I've attached an example of how I want the data to be outputted.
Screenshot 2022-10-08 000417

As an example, I've started with this examples.jsonl file:

{"text": "my name is Sok"}
{"text": "my name is Noa"}

This is fed into an NER task.

python -m prodigy ner.manual issue-6018 blank:en examples.jsonl --label name

This gives me the following interface.

After annotating two examples, I can call db-out.

python -m prodigy db-out issue-6018

This yields:

{"text":"my name is Sok","_input_hash":1096766140,"_task_hash":-1686437232,"_is_binary":false,"tokens":[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Sok","start":11,"end":14,"id":3,"ws":false}],"_view_id":"ner_manual","spans":[{"start":11,"end":14,"token_start":3,"token_end":3,"label":"name"}],"answer":"accept","_timestamp":1665405458}
{"text":"my name is Noa","_input_hash":253661630,"_task_hash":-971655848,"_is_binary":false,"tokens":[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Noa","start":11,"end":14,"id":3,"ws":false}],"_view_id":"ner_manual","spans":[{"start":11,"end":14,"token_start":3,"token_end":3,"label":"name"}],"answer":"accept","_timestamp":1665405546}

Note that the ner.manual recipe has a side-effect that the tokens also appear. Not every recipe does this.

If I understand you correctly, you're only interested in the tokens? Then you could use jq from the commandline.

python -m prodigy db-out issue-6018 | jq -c ".tokens"

This yields the tokens as an array on each line.

[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Sok","start":11,"end":14,"id":3,"ws":false}]
[{"text":"my","start":0,"end":2,"id":0,"ws":true},{"text":"name","start":3,"end":7,"id":1,"ws":true},{"text":"is","start":8,"end":10,"id":2,"ws":true},{"text":"Noa","start":11,"end":14,"id":3,"ws":false}]

If this is what you need, then that's grand. However, if you want to be more flexible, I might instead recommend writing a script that fetches the data that you need from Python. In particular, you'll need the get_dataset_examples-method.

Example

Here is the Python code that should help you get started. Note that you'll need to replace issue-6018 with your dataset name.

from prodigy.components.db import connect

# Connect to DB 
db = connect()

# I'm using `db.get_dataset` which will be removed in the future 
# in favor of the more explicit Database.get_dataset_examples.
examples = db.get_dataset("issue-6018")
print("\nThese are all the examples:\n")
print(examples)

print("\nThese are just the tokens: \n")
print([{'tokens': e['tokens']} for e in examples])

1 Like