only 154 lines in jsonl file but reading in 9,265 total documents

I am using this command

python -m prodigy ner.manual emails en_core_web_sm documents.jsonl --label ACTION,ENGINEER,SUPPLIER,TIME,ORG,LOCATION, --patterns seed_patterns.jsonl

but when Prodigy tool launches it loads 9,265 documents. I have tried a few things from creating new .json files to adding flags such as --total-num-tasks 154 and still getting the issue. Can someone help me with this?

hi @dazaink,

Thanks for your question.

Can you provide more details about your documents.jsonl file?

Can you run this script on it?

import json

def count_text_keys(jsonl_file_path):
    count = 0

    with open(jsonl_file_path, 'r') as file:
        for line in file:
            try:
                data = json.loads(line)
                if 'text' in data:
                    count += 1
            except json.JSONDecodeError:
                print(f"Skipping invalid JSON: {line}")

    return count

# Replace with the actual path to your JSONL file
file_path = 'documents.jsonl'
result = count_text_keys(file_path)

print(f'Total dictionaries with key "text" in {file_path}: {result}')

I'm curious - where are you getting 9,265 from? Can you provide where you see this?

Can you also provide your logs?

Where do you see the flag --total-num-tasks? That's not a built-in flag for ner.manual nor any other recipes.

If you're looking to modify your source (input) file, you should apply some filters on the front end with a Python script (e.g., remove certain examples).

Can you also provide your prodigy version (run prodigy stats) too and any modifications to your prodigy.json file?