Does Prodigy allow loading all files from a filepath


I have various txt files in a folder, is it possible for me to load the all the txt files from the folder into my prodigy annotation dataset for entity annotations ?

Out-of-the-box, Prodigy currently supports loading in data from single files of various types – for text, that’s .jsonl, .json, .txt and .csv. You can specify the loader via the --loader argument on the command line. If no loader is set, Prodigy will use the file extension to pick the respective loader.

prodigy ner.teach your_dataset en_core_web_sm /path/to/data.txt

So if you have multiple .txt files and want to use them all, the easiest way would be to combine them into one file. Alternatively, you can also always write your own loader script.

If no source argument (file path etc.) is set on the command line, it will default to sys.stdin. This lets you pipe data forward from a different process, like a custom script. For example:

python | prodigy ner.teach your_dataset en_core_web_sm

All your custom loader script needs to do is load the data somehow, create annotation tasks in Prodigy’s format (a dictionary with a "text" key) and print the dumped JSON. For example:

from pathlib import Path
import json

data_path = Path('/path/to/directory')
for file_path in data_path.iterdir():  # iterate over directory
    lines = Path(file_path).open('r', encoding='utf8')  # open file
    for line in lines:
       task = {'text': line}  # create one task for each line of text
       print(json.dumps(task))  # dump and print the JSON

This approach works for any file format and data type – for example, you could also load in data from a different database or via an API. If you can load your data in Python, you can use it with Prodigy :blush:

There’s currently also an open feature request for allowing paths to directories instead. If that’s something you’re interested in having Prodigy support out-of-the-box, you can vote for it on that thread.

1 Like

Thanks a lot for the help.

1 Like