End-to-end production system for ranking horizontal-resistance breakout setups on Indian equities (NSE main board). Combines a rule-based primary detector, a calibrated LightGBM meta-model, sector + theme rotation context, and a parallel pre-breakout scanner. Runs daily via launchd.
┌──────────────────────────────────────┐
│ DATA LAYER (data/) │
│ • OHLCV cache (Angel One) │
│ • Benchmark ^NSEI │
│ • Fundamentals + sector mapping │
│ • Sector / theme equal-weight idx │
└─────────────────┬────────────────────┘
│
┌─────────────────────────┴──────────────────────────┐
▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ ML PIPELINE │ │ LEGACY SCANNER │
│ scripts/01..07 │ │ legacy_scanner/ │
│ │ │ │
│ Setup detector → │ │ Pre-breakout v3.5: │
│ ~110 features → │ │ hard gates → │
│ LightGBM + │ │ pattern detectors → │
│ isotonic + │ │ HC / WL classifier │
│ stacker + regime │ │ │
│ + R-multiple head │ └──────────────────────┘
└────────┬────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DAILY SCAN (scripts/05_daily_scan.py) │
│ for each symbol: │
│ OHLCV tail-update → setup → features → predict P(win) │
│ filter @ watchlist / trigger thresholds → risk plan │
│ CHAIN → 08_sector_rotation.py │
│ ├─ GICS sector rotation │
│ ├─ Custom theme rotation │
│ └─ MLRotation_<date>.xlsx (combined workbook) │
└─────────────────────────────────────────────────────────────┘
│
▼
Output/ scan_*.csv, triggers_*.csv,
sector_rotation_*.csv, theme_rotation_*.csv,
watchlist_in_leaders_*.csv,
watchlist_in_theme_leaders_*.csv,
MLRotation_<date>_<time>.xlsx
The two systems (ML pipeline + legacy_scanner) are independent. The ML pipeline is the canonical daily output; the legacy scanner is a complementary, rule-only pre-breakout audit that does not depend on a trained model.
| Source | Used for | Auth |
|---|---|---|
| Angel One SmartAPI (primary) | Daily OHLCV, ^NSEI benchmark; ~2 req/sec | .env: ANGEL_API_KEY, ANGEL_CLIENT_CODE, ANGEL_PIN, ANGEL_TOTP_SECRET (TOTP via pyotp) |
| jugaad-data (fallback 1) | NSE main board only | None |
| yfinance (fallback 2) | Broad coverage; fundamentals & market-cap (Angel does NOT expose these) | None |
| NSE archives CSV | Universe seed (NSE main + NSE Emerge SME) | None |
| BSE listing JSON | BSE SME platform + NSE→BSE fallback map | None |
| NSE F&O list CSV | Drop F&O underlyings from pct-down screener | None |
| ^CRSLDX (Yahoo) | NIFTY 500 RS baseline (pct-down screener) | None |
| NSE price-band-hitter API | Upper-circuit deep-dive (upper_band/) |
None |
OHLCV routing is centralized in legacy_scanner/data_provider.py
(download(...) → Angel → jugaad → yfinance). Mirror module
src/yf_ingestion.py is used by the ML pipeline.
ML Breakouts/
├── configs/
│ └── default.yaml # ALL ML knobs: universe, setup gates,
│ # labeling, CV, model params, inference
│ # thresholds, Kelly sizing
│
├── data/ # local cache + ML artefacts
│ ├── ohlcv/<SYM>.parquet # ~2500 daily-bar parquets (Angel One)
│ ├── benchmark_NIFTY.parquet # ^NSEI close
│ ├── fundamentals.parquet # yfinance .info snapshot + ranks
│ ├── sector_indices.parquet # equal-weight GICS sector levels
│ ├── theme_indices.parquet # equal-weight custom themes
│ ├── sectors_seed.json # legacy sector seed (deprecated)
│ ├── training/
│ │ └── setups.parquet # ~24k rows × 119 cols (replay output)
│ └── models/ # trained artefacts (12 files)
│ ├── lgbm.txt # final LightGBM
│ ├── lgbm_regime_bull.txt # regime sub-models
│ ├── lgbm_regime_bear.txt
│ ├── calibrator.pkl # isotonic
│ ├── regime_calibrators.pkl
│ ├── stacker.pkl # LR meta-learner over LGBM/XGB
│ ├── rmult_regressor.txt # R-multiple head (Kelly sizing)
│ ├── features.json # feature column list
│ ├── best_hparams.yaml # latest Optuna winner
│ ├── oof_predictions.parquet # CV OOF predictions for evaluation
│ ├── cv_metrics.csv
│ └── feature_importance.csv
│
├── src/ # ML pipeline modules
│ ├── paths.py # config + canonical paths
│ ├── data_ingestion.py # universe derivation from cache
│ ├── yf_ingestion.py # OHLCV download / tail-update
│ ├── universe.py # universe filters
│ ├── setup_detector.py # rule-based primary (resistance + base)
│ ├── features.py # ~110 feature engineering pipeline
│ ├── labeling.py # triple-barrier + sample uniqueness
│ ├── cv.py # walk-forward + purged k-fold w/ embargo
│ ├── model.py # LightGBM trainer + isotonic + stacker
│ │ # + regime sub-models + R-multiple head
│ ├── fundamentals.py # yfinance .info fetcher + ranks
│ ├── sectors.py # GICS sector index builder
│ ├── sector_rotation.py # full sector rotation pipeline
│ ├── theme_rotation.py # custom theme rotation pipeline
│ └── index_constituents.json # custom theme baskets
│
├── scripts/ # phase-numbered driver scripts
│ ├── 01_build_dataset.py # bootstrap/refresh OHLCV; chains 02→03→04
│ ├── 02_build_training_set.py # replay history → setups.parquet
│ ├── 03_train_model.py # train LGBM + calibrator + heads
│ ├── 04_evaluate.py # OOF metrics, decile, calibration
│ ├── 05_daily_scan.py # PRODUCTION daily scan + chains 08
│ ├── 06_fetch_fundamentals.py # refresh fundamentals + sector idx
│ ├── 07_optuna_search.py # LGBM hyperparam search
│ ├── 08_sector_rotation.py # sector + theme rotation report
│ └── run_daily_scan.sh # launchd entry point
│
├── legacy_scanner/ # rule-only pre-breakout system
│ ├── angel_client.py # SmartAPI wrapper (TOTP, scrip-master)
│ ├── data_provider.py # Angel → jugaad → yfinance fallback
│ ├── breakout_scanner_angel.py # v3.5 HC/WL classifier (live scanner)
│ ├── multi_pct_down.py # pct-down screener (NSE/SME/BSE-SME)
│ ├── breakout_charts/ # rendered Plotly chart HTML
│ ├── _cache_ohlcv/ # legacy scanner OHLCV cache
│ ├── multi_pct_down_report.xlsx # input universe for the scanner
│ └── breakout_watchlist.xlsx # latest scanner output
│
├── upper_band/ # NSE upper-circuit deep-dive pipeline
│ ├── upperband_analyze.py
│ ├── charts/<DATE>/
│ ├── analysis/<DATE>/
│ └── Upper Band.csv
│
├── launchd/
│ └── com.mlbreakouts.dailyscan.plist # macOS scheduler config
│
├── notebooks/ # exploratory work (gitignored)
├── tests/
├── Output/ # daily artefacts (see § 5)
├── logs_*.txt # historical phase / optuna run logs
├── README.md
└── requirements.txt
configs/default.yaml is the single source of truth for the ML
pipeline. The legacy scanner has its own thresholds embedded in
legacy_scanner/breakout_scanner_angel.py (v3.5 HC rule:
pocket_pivot AND lvs >= 0.4 AND base_tight).
Setup detector → features → meta-model → calibrated probability.
| Stage | What it does |
|---|---|
| Primary | src/setup_detector.py — fractal-pivot resistance detection + base-formation gates (config: setup block). Hard gates eliminate ~99% of bars. |
| Features | src/features.py — ~110 columns: price/volume, sector RS, regime, fundamentals (fund_* + cross-sectional ranks), volatility, base geometry. Inputs: bench, sector indices, fundamentals. |
| Labels | src/labeling.py — triple-barrier (+3 ATR / -1 ATR / 10d), plus require_close_above_R confirmation in first 5 bars. Sample-uniqueness weights from label overlap. |
| CV | src/cv.py — Walk-forward (default; expanding window) or PurgedKFold with embargo. Last 12 months are strict holdout never seen in training. |
| Model | src/model.py — LightGBM with monotone constraints + isotonic calibration. Optional stacker (LGBM+XGB-DART → LR meta-learner), regime-conditioned sub-models (split by bench_above_200dma), R-multiple regressor for Kelly sizing. |
| Inference | At each bar: detect setup → make_features → predict → blend regime/stacker → calibrate → threshold (watchlist 0.15, trigger 0.22). |
legacy_scanner/breakout_scanner_angel.py reads
multi_pct_down_report.xlsx (output of multi_pct_down.py), pulls
OHLCV via Angel, and applies hard gates + pattern detectors.
v3.5 HC rule: pocket_pivot AND lvs >= 0.4 AND base_tight
(base_tight = range_pct ≤ 0.15 OR trailing_pct ≤ 0.08).
Pattern detectors: pocket_pivot (the only filter with measurable edge on n=247 backtested signals — 50% wr solo), TTM squeeze, Wyckoff spring, OBV divergence, flat-base flag, gap-fill flag. Cup-and-handle and VCP detectors are known-broken and excluded from HC promotion.
Calibrated against an in-tree backtest (backtest_breakouts.py,
analyze_signal_combos.py) over 71 as-of dates × 18 months. 90%
precision is statistically unreachable on the available signal density;
v3.5 produces ~0–2 HC + 1–5 WL per day.
All daily artefacts live in Output/:
| File | Source | Description |
|---|---|---|
scan_<YYYYMMDD>.csv |
05_daily_scan.py |
Full ranked watchlist (prob >= watchlist_thr) |
triggers_<YYYYMMDD>.csv |
05_daily_scan.py |
High-conviction subset (prob >= trigger_thr) |
sector_rotation_<YYYYMMDD>.csv |
08_sector_rotation.py |
GICS sector leaderboard |
sector_rotation.parquet |
08_sector_rotation.py |
Always-latest sector snapshot |
theme_rotation_<YYYYMMDD>.csv |
08_sector_rotation.py |
Custom-theme leaderboard |
theme_rotation.parquet |
08_sector_rotation.py |
Always-latest theme snapshot |
watchlist_in_leaders_<YYYYMMDD>.csv |
08_sector_rotation.py |
ML setups in top-N sectors |
watchlist_in_theme_leaders_<YYYYMMDD>.csv |
08_sector_rotation.py |
ML setups in top-N themes |
MLRotation_<YYYYMMDD>_<HHMMSS>.xlsx |
08_sector_rotation.py |
Combined 8-sheet workbook |
logs/daily_scan_<TS>.log |
run_daily_scan.sh |
Full launchd log (last 60 retained) |
Legacy scanner outputs land in legacy_scanner/:
breakout_watchlist.xlsx, breakout_charts/<symbol>_breakout.html.
Upper-circuit deep-dive outputs land in
upper_band/analysis/<DATE>/ and upper_band/charts/<DATE>/.
| File | Rewritten by daily run? | Notes |
|---|---|---|
data/ohlcv/<SYM>.parquet (~2500) |
Yes, by default | Skip with --no-refresh (or --no-update for 05_daily_scan). Bulk Angel pull dominates wall time (~10 min). |
data/benchmark_NIFTY.parquet |
Yes, by default | Skipped with --no-refresh. |
data/sector_indices.parquet |
Yes (always — rebuild_indices=True hard-coded) |
|
data/theme_indices.parquet |
Yes (unless --no-themes) |
|
data/fundamentals.parquet |
No | Only 06_fetch_fundamentals.py writes it (quarterly cadence). |
data/training/setups.parquet |
No | Only 02_build_training_set.py (quarterly). |
data/models/*.parquet (oof_predictions.parquet) |
No | Only 03_train_model.py (quarterly). |
| When | Command | Purpose |
|---|---|---|
| Daily (auto, launchd) | bash scripts/run_daily_scan.sh → 05_daily_scan.py → chains 08_sector_rotation.py |
Production scan + sector + theme rotation + combined workbook |
| Daily (manual, parallel) | python legacy_scanner/multi_pct_down.py then python legacy_scanner/breakout_scanner_angel.py |
Rule-only pre-breakout audit |
| Daily (optional) | python upper_band/upperband_analyze.py |
Upper-circuit cohort study |
| Quarterly refresh | 06 then 01 (auto-chains 02→03→04) |
Refresh fundamentals, OHLCV history, training set, retrain, evaluate |
| Yearly tune | 07 → 03 → 04 |
Optuna search → retrain with winners → re-evaluate |
| Ad-hoc evaluation | python scripts/04_evaluate.py |
OOF metrics, decile, calibration after any retrain |
cp launchd/com.mlbreakouts.dailyscan.plist ~/Library/LaunchAgents/
launchctl bootstrap gui/$(id -u) ~/Library/LaunchAgents/com.mlbreakouts.dailyscan.plist
launchctl enable gui/$(id -u)/com.mlbreakouts.dailyscan
launchctl kickstart -k gui/$(id -u)/com.mlbreakouts.dailyscanscripts/run_daily_scan.sh is the entry point: skips weekends
(DOW >= 6), tees output to Output/logs/daily_scan_<TS>.log,
rotates logs (keeps 60). Status:
launchctl print gui/$(id -u)/com.mlbreakouts.dailyscan | grep -E 'state|pid|runs|last exit code'# 1. Setup
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp legacy_scanner/.env.template legacy_scanner/.env # fill Angel credentials
# 2. Bootstrap data + train (first time only)
python scripts/06_fetch_fundamentals.py # fundamentals + sector idx
python scripts/01_build_dataset.py # OHLCV → train set → model → eval
# (auto-chains 01→02→03→04)
# Or run just step 01 without chaining:
python scripts/01_build_dataset.py --only # OHLCV cache only (~10 min)
# 3. Run daily scan
python scripts/05_daily_scan.py # full pipeline
# or skip refresh for a fast re-run:
python scripts/05_daily_scan.py --no-update --rotation-no-refresh
# 4. (parallel) legacy scanner
python legacy_scanner/multi_pct_down.py
python legacy_scanner/breakout_scanner_angel.py| Model | Verdict |
|---|---|
| Random Forest | Outdated since 2017. Loses to GBM on every tabular benchmark. |
| LightGBM | Industry standard for daily equity alpha. Used by WorldQuant, Two Sigma, Numerai, Kaggle JaneStreet/Optiver/JPX winners. |
| XGBoost | Equivalent to LightGBM, slightly slower. (Used as a stacker member.) |
| CatBoost | Strong alternative; disabled here (~30+ min/fold without GPU). |
| TabNet | Loses to LGBM in Shwartz-Ziv & Armon (NeurIPS 2022). |
| TFT / Transformer | Needs ≥50k samples and GPUs. We have ~24k setups. Overkill. |
| LSTM / GRU | Replaced by TCN/Transformer; both lose to LGBM on tabular data. |
LightGBM models P(setup succeeds), not next-day return.
renewAccessToken()SmartAPI bug: occasionalsignatureerror during mid-run token refresh; auto-recovers.- OHLCV cache hygiene:
data/ohlcv/has ~2503 files but the live universe is ~1140 — older listings are never evicted. fundamentals.parquetis not auto-refreshed: sector mappings can drift. Re-runscripts/06_fetch_fundamentals.pyquarterly.require_close_above_Rlabel v2: tighter than v1; positives are rarer (~12% base rate) but more meaningful.scale_pos_weight: 0.9354intentionally <1 to penalize false positives in the imbalanced setting (Optuna chose this).- Legacy
cup_and_handle+vcp_contractions: detectors are known broken (0 / 16 wins; 7.7% wr respectively). Excluded from v3.5 HC.
- López de Prado, Advances in Financial Machine Learning (2018) — Ch. 3 (triple-barrier), Ch. 4 (sample weighting), Ch. 7 (purged CV)
- Microsoft Qlib — https://github.com/microsoft/qlib
- Shwartz-Ziv & Armon, Tabular Data: Deep Learning is Not All You Need (NeurIPS 2022)
- Numerai whitepaper — feature neutralization & meta-modeling
- Bensdorp, Automated Stock Trading Systems
- O'Neil, How to Make Money in Stocks (pocket-pivot definition used in legacy_scanner)
Details of when to run and train ML model, which file to run and cyclical duration:
Details of parquet files and there run cycle:
Details of all the parquet files in "data" folder, along with which python file generates which parquet file: