ST
StateTrace
Visual Quant & Low-Latency Systems Lab
GitHub
Curriculum/purged-k-fold-cv

Purged k-Fold CV

research process·L2 · idiom·stub
Replacesthe belief that scikit-learn's `KFold` is safe for time-series.

Vanilla k-fold randomly shuffles samples; for return data this leaks information between folds because labels with overlapping observation periods correlate. Purged k-fold (López de Prado, AFML ch. 7) removes training samples whose labels overlap with test labels, then adds an embargo period after each test fold to prevent serial-correlation leakage. The fix that makes CV credible for financial backtesting.

Unlocks
Bridges
  • combinatorially-symmetric-cvmodel to implementation
    CPCV (López de Prado, AFML ch. 12) generalises purged k-fold to many train/test splits, allowing direct measurement of the Probability of Backtest Overfitting (PBO) — the fraction of splits where in-sample rank predicts out-of-sample rank. Same purging mechanism, richer statistic.
  • embargo-period-sizingshared measurement
    Embargo length is the serial-correlation horizon of the label generator. Wrong sizing — too small, leakage survives; too large, training samples wasted. The right size is measured (Ljung-Box autocorrelation on label residuals), not picked.
Status

This concept is a node in the curriculum DAG. The full lab — page blocks, done state, references — has not been authored yet. The relations above describe where it sits in the graph.

Author at: content/concepts/purged-k-fold-cv/card.ts