Does Train Curve Select Examples in Annotation Order?

When Prodigy selects subsets of annotated data to draw a training curve with a recipe like ner.train-curve, are the samples selected randomly or in the order in which they were annotated?

A learning curve drawn from samples taken in the order in which they were annotated not only shows whether more annotation will help, it also shows the effectiveness of Prodigy’s particular active learning method. If active learning is working well, the in-order learning curve should rise faster than the random learning curve. (And be better still than random selection from a wholly annotated corpus.)

Not at the moment, no. Internally, ner.train-curve delegates to ner.batch-train and runs it n_samples times (by default, 4) with a different factor set (by default 0.25, 0.5, 0.75 and 1). This also means that the examples are shuffled on each run – and if you’re not providing an evaluation set, the evaluation examples are held back from that sample as well.

I really like your idea, though and it’s definitely worth experimenting with! :+1: Ideally, you would want a dedicated evaluation set for this approach. Splitting the examples randomly is fine for quick experiments and a good approximation – but if you’re interested in the development over time, you usually always want an evaluation set you can reuse.

Next, you could try removing the two instances of random.shuffle(examples) before ner.batch-train enters the training loop and instead, only shuffle the examples after the smaller sample based on the factor is generated:

examples = examples[:int(len(examples) * factor)]
random.shuffle(examples)

If your set contains 4000 examples, the train curve should then use examples[:1000], examples[:2000], examples[:3000] and examples[:4000], in the exact order of the original dataset.

If you end up trying this out, definitely keep us updated! If it works well, this would be a nice option to add to the train-curve recipes.