Loading Multiple Files for ner.teach

Hi,

Do I have to create a new recipe for loading multiple files during ner.teach? I have a corpus of data with varied files and I want to train the model on all of them. How do I do it without invoking ner.teach command multiple times?

Thanks for the software and the help,
Sandeep

Prodigy’s built-in file loaders currently only support single files and not directories. But it’s pretty easy to load in your own corpora using custom ETL logic (even without a custom recipe).

The source argument your set on the command line is usually a path to a file – but if it’s not set, it defaults to sys.stdin. This means you can pipe data through it, for example:

python load_my_corpus.py | prodigy ner.teach my_dataset en_core_web_sm

The load_my_corpus.py script can then load and pre-process the corpus, and print the JSON-dumped annotation tasks to stdout. Here’s some pseudocode to illustrate the idea:

for data_file in corpus:
    data = preprocess_your_data_somehow(data_file)
    for line in data:
        task = {'text': line}
        print(json.dumps(task))

Depending on the complexity of your corpus, you can add rules to handle files differently depending on their type, or read out different fields. You could also import and re-use Prodigy’s built-in loaders in prodigy.components.loaders – see the “Loaders” section in the PRODIGY_README.html for more details and examples.

Thanks. I tried sending in whole files here. But it worked. Pasting the code below in case someone needs to use it.

import glob2, os, json
corpusfolderpath = 'xyz'
for filename in glob2.glob(corpusfolderpath+'/*'):
    if os.path.isfile(filename):
        with open(filename, 'rb') as f:
            text = f.read()
            task = {'text': text}
            print(json.dumps(task))
1 Like

do you know what happens if sentence is already in the annotations dataset? is it ignored? or do we get to be asked again?

By default, Prodigy doesn’t make any assumptions about this and will let you re-annotate the same task. But you can tell it to exclude annotations of existing datasets by setting --exclude dataset_name (or multiple, comma-separated names). This is also very useful when creating evaluation sets.

The tasks are compared bashed on their hahes. When a new annotation task comes in, Prodigy assigns an "_input_hash" to the task, based on its content – by default, properties like "text". When you run ner.teach, Prodigy will add "spans" to each task containing the entity you’re annotating. The input hash and the annotation features are then hashed again to create a _task_hash, which is used to determine whether two annotation tasks are the same.

This means that Prodigy will exclude tasks asking the same questions – but still allow different questions about the same text that you haven’t answered before. You can find more details on the hashing in the PRODIGY_README.html, for example in the API docs of the set_hashes helper function.

1 Like