Model Training & Dataset Exploration

Hello,
My coworker and I have been training a NER model using batch-train and we have a few questions about the definitions of terms such as iteration and batch size, drop-out, and eval-split. Can you elaborate a bit on what these terms mean with regard to Prodigy?
On Stack Overflow, I’ve seen the definition of an iteration as a pass over the batch size and an epoch as a pass over all the training examples. So, if we have 100 examples, and a batch size of 4 with 10 iterations, does that mean that we will only train on 40 examples?
I’ve also noticed an option called factor that denotes the portion of the examples to train on. Is there a way to set the epoch?
In our dataset, we have 2900 examples. When we ran batch-train, it reports the number of examples used for training and the number of examples used for evaluation. Since our eval-split is 0.2, the program reported that 315 examples were used for evaluation and samples for training to be 1929, which don’t quite make sense. 20% of 2900 should be 580. Is there a reason why these numbers are shown?

Lastly, is there a way for any given dataset, whether we can get the number of entities and number of examples in that dataset?

Thanks for clearing up my misunderstandings and thank you for your time in reading this!

In the Prodigy training recipes, we use epoch and iteration synonymously – sorry if this was a bit confusing. So --n-iter is the number of iterations over the training data, i.e. the number of times the training loop runs over the whole data. Within the training loop, the data is divided into batches, defined by the batch size setting.

Here's a quick terminology overview:

  • iteration: number of times the training loop runs
  • batch size: number of examples in one batch (divided within a training loop)
  • dropout: the dropout rate
  • eval split: the percentage of examples to hold back for evaluation, if no dedicated evaluation set is provided (mostly relevant for quick experiments – it'll never be as reliable as a separate evaluation set)
  • factor: percentage of examples to train on – e.g. 0.5 would only train on half of the examples. (this is less relevant for the regular batch train recipes, but it's used in the train-curve recipes to show you results on different portions of the data, so you can see whether the accuracy improves with more examples)

It's difficult to give a definitive answer without knowing your data, but one possible explanation is that those are the true unique examples after all spans have been merged. Before training, Prodigy will find all examples on the same input and merge them into one single example with multiple spans (instead of several examples with one span each). For instance, if you're using ner.teach, you may end up with several answers about the same input text, each about a different pre-highlighted entity.

I've actually been thinking about adding another log statement that reports the numbers before and after merging. This would make this a bit less confusing, and it might also be a useful number to know.

Prodigy will export the data it used for training and evaluation as JSONL files in the model directory btw, so you'll always have a reference to what exactly was used.

We tried to come up with a data format that's easy to analyse and a user-facing API to interact with the database and datasets, so you can write your own scripts to analyse whatever you need.

Here's an example – let's say you want to get the total number of examples, and counts for the individual entity labels – for both the rejected and accepted examples. This would let you see which label was annotated the most, which one was accepted or rejected the most and so on:

from prodigy.components.db import connect
from collections import Counter

db = connect()
examples = db.get_dataset("your_dataset")
print("Number of examples": len(examples))

accepted_counter = Counter()
rejected_counter = Counter()

for eg in examples:
    # Each example is a dict with properties like "text", "spans" etc.
    for span in eg.get("spans", []):
        label = span["label"]
        if eg["answer"] == "accept":
            accepted_counter[label] += 1
        elif eg["answer"] == "reject":
            rejected_counter[label] += 1

print(accepted_counter)
print(rejected_counter)

You can find more details on the database API in your PRODIGY_README.html. If you only need the total count and accept/reject stats, you can also run the prodigy stats command with the name of the dataset. For example:

prodigy stats your_dataset