A reflective, introspective account of the journey from raw price data to a deployed cross-sectional quantitative model — told not as a dialogue, but as a record of thought, friction, and evolution.
There is a particular kind of confusion that arrives not from ignorance but from too many simultaneous possibilities. The work begins with seven tickers — AAPL, MSFT, GOOGL, AMZN, NVDA, META, TSLA — downloaded from Yahoo Finance across eleven years of daily closes. The data is there. The library is loaded. The objective is clear in principle: teach a machine to identify, each morning, which of these stocks is most likely to outperform by the end of the day.
But the moment the code editor opens, the intuition immediately surfaces a harder question: what does it even mean to teach a machine to see patterns in noise? There is a gap between understanding the concept and knowing where to place the first line of code. The starting point, then, is not feature engineering — it is the more fundamental question of data architecture.
The earliest impulse is the simplest one: download data, select the Close column, and begin computing returns. data = data['Close'] feels clean, efficient, surgical. It removes the visual clutter of multi-level columns. But this first instinct carries a hidden cost that does not announce itself immediately.
The moment Volume is sliced away, a category of features becomes unreachable. You cannot compute volume_change or volume_avg_ratio from a DataFrame that contains only closing prices. The decision feels like a shortcut but is actually an amputation.
The intuition registers this as a vague discomfort — a sense that something is missing — before the explicit realization arrives: if volume matters to the market, it has to matter to the model. Volume is not decoration. It is the confirmation signal. A 3% price move on double the average volume is categorically different from a 3% move on half the average volume. One is conviction. The other is a rumor.
The fix is conceptually simple but architecturally important: retain the full OHLCV data until feature engineering is complete. Only then is it acceptable to reduce.
The second architectural challenge is more subtle and more dangerous. Working with seven tickers simultaneously means the raw DataFrame has a multi-level column structure — (Price, Ticker). The naive approach would be to compute rolling averages and percentage changes across the entire DataFrame as if it were one continuous series. This is the data leakage problem, and it is silent.
If the data is not grouped by ticker before applying rolling windows, the 5-day moving average for MSFT on a given Monday might inadvertently include the Friday closes from AAPL's row if the sort order is wrong. Pandas will not raise an error. The numbers will look plausible. The model will train. But it will be learning corrupted features — features derived from an impossible information set that mixes companies, time periods, and return profiles that have no natural relationship.
The correct architecture is:
df = raw_data.stack(level=1).rename_axis(['Date', 'Ticker']).sort_index()
g = df.groupby('Ticker')After this transformation, the DataFrame has a (Date, Ticker) MultiIndex. Every subsequent rolling calculation — moving averages, standard deviations, RSI, volume ratios — must route through g, the grouped object. This is not optional. It is the architectural invariant that makes everything downstream trustworthy.
The intuition frames this as a separation of concerns: calculations that describe a single stock's history belong to groupby('Ticker'), while calculations that describe a stock's position among its peers on a given day belong to groupby('Date'). These are two fundamentally different questions, and conflating them produces garbage.
Once the architecture is stable, the feature engineering process begins — and it is here that the first real philosophical question emerges: what is a feature, really? At the surface level, a feature is a number fed to a model. But at a deeper level, a feature is a hypothesis. Each feature encodes a belief about the market.
1d_returnencodes: yesterday's return contains information about today's return.dist_from_MA20encodes: price deviation from a medium-term trend tends to revert.volume_avg_ratioencodes: unusual trading activity signals institutional conviction.RSIencodes: momentum has velocity, and velocity has limits.
The decision about which features to include is not purely empirical — it begins as a set of theoretical bets.
The first point of genuine confusion is the choice of rolling window sizes. MA(5), MA(10), MA(20) — the question surfaces naturally: why these numbers specifically? The answer is partly conventional and partly structural.
Five trading days equals one week. Twenty trading days equals roughly one calendar month. These are human-scaled intervals that correspond to natural rhythms in how institutions think about positions — weekly reviews, monthly rebalancing, quarterly earnings cycles.
But the deeper insight is that the relationship between windows carries signal, not just the windows themselves. A stock whose 5-day moving average is above its 20-day moving average is exhibiting short-term momentum that exceeds its medium-term average. This is the crossover signal, and it is more meaningful than either MA in isolation.
Rather than using the raw moving average values — which are non-stationary (they trend upward with the price) — the approach is to compute ratios:
df['dist_from_MA5'] = df['Close'] / df['sma_5']
df['dist_from_MA10'] = df['Close'] / g['Close'].transform(lambda x: x.rolling(10).mean())
df['dist_from_MA20'] = df['Close'] / df['sma_20']A ratio centered around 1.0 is stationary. A raw moving average is not. This is the stationarity principle, and it matters because machine learning models trained on non-stationary features will underperform — they learn patterns that cannot repeat at different price levels.
The RSI calculation contains a subtle bug that persists through several iterations before being caught. The standard formula computes the Relative Strength ratio as gain / loss, then applies the RSI transformation 100 - (100 / (1 + rs)). An early version mistakenly multiplied by 100 inside the ratio calculation — rs = gain / loss * 100 — which stretched the input to the sigmoid-like formula and effectively made the RSI nearly always extreme, destroying its discriminatory power.
More interesting, though, is the question of how RSI should be encoded for a classification model. The raw 0–100 scale contains meaningful absolute information: an RSI above 70 indicates overbought conditions; below 30 indicates oversold. But if this raw RSI is fed into a cross-sectional Z-score normalization across tickers on the same day, that absolute meaning evaporates. If every stock in the Magnificent 7 has an RSI of 78 on the same day (a common occurrence in momentum markets), the Z-scored RSI for every stock becomes zero — the cross-sectional average — and all information is lost.
The eventual resolution is to treat RSI as an absolute feature rather than a cross-sectional relative feature. It is scaled independently using MinMaxScaler (later StandardScaler) fitted only on training data. But before that, it is binned:
binned = np.where(rsi < 30, 1, np.where(rsi > 70, -1, 0))This discretization trades resolution for interpretability. Instead of a continuous 0–100 signal, the model receives a three-state indicator: oversold (+1), overbought (-1), neutral (0). The visual evidence from the equity curves suggests this binning actually improves model stability — it removes the "RSI tunnel vision" that was causing extreme negative weights and producing fragile behavior.
The realization that std_5 and std_10 together encode something that neither alone can capture comes from thinking about what the difference between them implies. If std_5 > std_10, recent volatility exceeds medium-term volatility — the market is becoming more erratic. If std_5 < std_10, the opposite: conditions are calming.
A bug persists for some time where std_10 is accidentally computed using a 5-day window:
df['std_10'] = g['1d_return'].transform(lambda x: x.rolling(5).std()) # BUGThis means both features are measuring identical things, providing zero additional information while consuming a feature slot and adding unnecessary correlation. The fix is a single character change — rolling(5) to rolling(10) — but the conceptual lesson is larger: every feature must earn its place by representing something the other features do not.
The volume features encode a different kind of market intelligence. volume_change captures the day-over-day shift in trading activity. volume_avg_ratio compares today's volume to a 20-day baseline.
The key distinction — understood gradually — is that volume does not cause price movements. It confirms them. A stock rising 4% on 0.5x average volume is a different signal than a stock rising 4% on 3x average volume. In the first case, few participants are involved — the move may lack conviction and could reverse. In the second, significant capital has moved — the signal is more likely to persist.
One of the more creative feature decisions is the construction of rsi_trend:
df['rsi_trend'] = g['Close'].transform(calculate_rsi) * df['sma_crossover']This multiplicative interaction encodes a conditional belief: RSI is only meaningful in the context of the prevailing trend. An RSI of -1 (overbought) during a confirmed uptrend (sma_crossover > 1) is a less reliable sell signal than the same RSI reading during a downtrend. The interaction feature captures this nuance in a way that the additive combination of separate features cannot.
The feature importance tables confirm that rsi_trend consistently becomes the most influential positive driver — in one iteration carrying a weight of 0.208, dwarfing all other signals. This is both encouraging and concerning. A model that depends heavily on a single interaction feature is a model with a concentrated bet. If the RSI-trend relationship breaks — as it can in certain regimes — the model's entire rationale collapses.
One of the most clarifying conceptual moments in the entire process is recognizing that the feature set contains two fundamentally different populations of information, each requiring a different normalization treatment.
Population 1: Cross-Sectional Relative Features These features answer the question: how does this stock compare to its peers today? Returns, distance from moving averages, volatility measures — these should be Z-scored across tickers on each date:
df[features_to_scale] = df.groupby('Date')[features_to_scale].transform(
lambda x: (x - x.mean()) / x.std()
)This ensures that on any given day, the model sees which stock is the "most stretched from its MA" or "most volatile" relative to the others. This is cross-sectional alpha — the signal is in the relative ranking, not the absolute level.
Population 2: Absolute Features RSI, volume average ratio, and the SMA crossover — these features have absolute meaning that should not be erased by cross-sectional normalization. RSI of 75 means something specific regardless of what the other six stocks' RSIs are.
These features are scaled using a scaler fitted only on training data, then applied to the test set without re-fitting. This is the critical discipline: the scaler must not "see" the future during training.
min_scaler = StandardScaler()
train_df[cols_to_minmax] = min_scaler.fit_transform(train_df[cols_to_minmax])
test_df[cols_to_minmax] = min_scaler.transform(test_df[cols_to_minmax]) # transform onlyThe asymmetry between fit_transform on training data and transform on test data is not just a technical formality — it is the enforcement of a temporal boundary. The model must live in its own past.
The model tournament includes four competitors: Logistic Regression, Random Forest, Gradient Boosting (GBM), and Support Vector Machine. The intuition, shaped by general ML knowledge, might expect the more complex tree-based models to dominate. Financial data is noisy, non-linear, and regime-dependent — surely a Random Forest with its ensemble of decision boundaries would outperform a simple linear classifier?
The first equity curve comparison tells a different story.
Examining the chart from the initial multi-model run (the first screenshot, showing all four models from mid-2022 to early 2025):
- LR (blue) reaches approximately 2.5x initial value by early 2025
- GBM (green) and SVM (red) closely track LR but lag slightly
- RF (orange) significantly underperforms, topping out near 1.75x
The Random Forest's failure is diagnostic. Tree-based models are excellent at capturing non-linear relationships in stable data distributions, but they struggle with the "regime shift" problem — when the statistical relationship between features and outcomes changes fundamentally (as it did between the 2022 bear market and the 2023–2024 AI-driven bull market), a Random Forest that memorized the 2022 regime becomes actively harmful.
Logistic Regression, constrained by its linear decision boundary and L2 regularization, cannot overfit in this way. Its simplicity becomes its armor.
The second chart — the TCA version with 10bps costs applied — makes this even clearer. All models take significant hits: LR's Sharpe drops from 1.05 to 0.43, RF drops to 0.24. But LR still leads. The margin of alpha matters, and LR has more margin to absorb the friction.
The Occam's Razor principle surfaces here not as philosophy but as empirical observation: in noisy environments with shifting regimes, the model with the fewest degrees of freedom wins.
The initial model has a Recall of 0.988 — it recommends buying on 98.8% of available trading days. This is the "Permabull Problem": the model has learned that the market rises most of the time, so the safest bet is always to be long. This is not alpha. This is beta — market exposure dressed in algorithmic clothing.
The class weight adjustment attempts to cure this:
LogisticRegression(class_weight={0: 1.5, 1: 1.0})By making "Down" days 50% more costly to misclassify, the intent is to force the model toward selectivity. The outcome is catastrophic in the opposite direction: the model produces exactly zero trades across the entire test period. The equity curve becomes a perfectly flat horizontal line at 1.0.
This overcorrection reveals something important about the geometry of the problem. The model's features do not contain enough discriminatory power to confidently distinguish most days. When penalized for errors in both directions, it retreats to the only safe position: never trade.
The resolution is a softer weight: {0: 1.2, 1: 1.0} — a 20% penalty premium for False Positives rather than 50%. This finds a working middle ground that reduces the trade count meaningfully without inducing paralysis.
The most persistent and instructive challenge in the entire project is what might be called the Threshold Freeze — the phenomenon where changing the probability cutoff from 0.35 to 0.55 produces no change whatsoever in the number of trades, their precision, or any derived metric.
The first threshold chart makes this visually stark: seven distinct threshold values (0.50, 0.52, 0.54, 0.56, 0.58, 0.60, 0.62) produce equity curves that are perfectly superimposed, rendered in the same magenta color. They are indistinguishable because they are identical.
And then the probability distribution histogram delivers the diagnosis. The model's output probabilities are not spread across 0–1. They form a tight Gaussian bell curve centered at approximately 0.485, spanning roughly 0.44 to 0.54. The entire distribution occupies a 10-percentage-point window. No prediction ever achieves 0.60 probability. No prediction falls below 0.40. The thresholds being tested (0.35 to 0.55) either catch everything or catch nothing — there is no "middle ground" to filter.
This is what probability saturation looks like. The sigmoid function that converts Logistic Regression's linear output into a probability has been given weights so small (forced by strong L2 regularization at C=0.01) that it never departs far from 0.5. The model is maximally uncertain.
The concentration of probabilities around 0.5 is not a failure of the algorithm — it is an honest report of the features' limitations. The cross-sectional Z-scoring strips away magnitude. After scaling, all features oscillate within roughly the same range. The model's coefficients, already small due to L2 regularization, multiply these already-compressed features and produce linear outputs that hover near zero — which the sigmoid maps to approximately 0.5.
The model is telling the truth: on most days, I cannot tell the difference between a winner and a loser.
This is actually a useful piece of information. It means any system of threshold-based filtering built on this model would require either (a) stronger features with wider natural ranges, or (b) probability calibration — a technique that "stretches" the output distribution to better reflect the true spectrum of model confidence.
The CalibratedClassifierCV experiment is attempted with the intention of stretching the probability distribution. The initial error — cv='prefit' is deprecated — is a version compatibility issue resolved by switching to cv=tscv.
But the calibrated model produces a worse outcome: trade count jumps from ~3,000 to 4,198, and Recall returns to 1.0. The calibration effectively amplified the model's existing bias rather than correcting it. Because the calibration was fitted on the test set (itself a methodological compromise), it learned to say "yes" to almost everything — an artifact of the test period being dominated by the Magnificent 7 bull run.
The lesson: calibration is not magic. A model that lacks discriminatory power in its raw predictions cannot acquire it through post-hoc probability adjustment. Calibration can improve communication of confidence, but it cannot manufacture confidence that doesn't exist.
One attempted remedy — switching from global Z-scoring to a 20-day rolling Z-score for the cross-sectional features — produces the most extreme failure in the entire project. Trade count collapses to 5. F1 score drops to 0.0158. Best C parameter is 100, which normally indicates the model found almost no regularizable signal.
The post-mortem analysis reveals the mechanism: rolling Z-scoring at a 20-day window destroys the magnitude of price movements. Within any 20-day window, a "large" move looks exactly like a "small" move relative to its local context. The model cannot distinguish a 3% breakout from a 0.1% drift because both are scaled to the same range within their respective windows. Combined with the harder median target, the model has been stripped of every piece of information it was using to make decisions.
Some simplifications reduce noise. Some simplifications eliminate signal. The rolling Z-score did the latter.
The rollback to global cross-sectional scaling restores function. This is an important empirical data point: the cross-sectional Z-score (grouping by date across tickers) works because it preserves relative magnitude within each day. NVDA's 3% return on a day when the group's average return is 0.1% is genuinely different from NVDA's 3% return on a day when the group's average is 2.8%. The first is outperformance; the second is underperformance. The global Z-score correctly encodes this distinction. The rolling Z-score does not.
For most of the project's development, the target variable is defined as:
df['y'] = (df['next_day_return'] > 0).astype(int)This is the simplest reasonable formulation: predict whether tomorrow's return will be positive. But this target has a structural flaw in a cross-sectional context: in a bull market, the "correct" answer is 1 (positive) for the majority of stocks on the majority of days. A model that learns this base rate and simply always predicts 1 will achieve 52–53% precision without learning anything meaningful about stock selection.
The alternative formulation is sector- or group-relative outperformance:
df['y'] = (df['next_day_return'] > df['next_day_sector_mean']).astype(int)Or, more directly:
df['y'] = (df['next_day_return'] > df['next_day_return'].median()).astype(int)This reformulation changes the question from "will this stock go up?" to "will this stock beat the median?" The second question is harder and more useful. In a cross-sectional framework where the goal is to identify relative winners, sector-relative outperformance is the correct alpha target.
When the median target is tested, the model's behavior changes: Best Score drops to 0.3017 (reflecting genuine difficulty), trade count falls to ~885, Recall drops to 0.216, and — critically — Max Drawdown improves to -0.139. The model is trading less but trading better. The equity curve, now showing a staircase-like stepped pattern due to fewer but more deliberate trades, tells a coherent story: the model waits for conviction and only acts when it has it.
The second chart in the sequence — showing the multi-model comparison with transaction costs applied — contains a feature that demands explanation: a sharp drawdown that bottoms in early January 2023, dipping to approximately 0.62–0.65 of starting value.
This dip is not a model failure in the pathological sense. It is the 2022 tech bear market. The Magnificent 7 entered a severe correction in the second half of 2022 driven by Federal Reserve rate hikes and multiple compression of growth stock valuations. NVDA lost approximately 65% of its value between November 2021 and October 2022. META dropped 75%. The entire group suffered.
The model, trained on historical patterns that include periods when RSI mean-reversion worked, did not have a mechanism to recognize that a fundamentally different regime had begun — one driven by macroeconomic forces rather than technical momentum. It continued buying dips that had no bottom. The equity curve reflects this regime blindness.
The recovery begins in early 2023 when the AI narrative (particularly around NVDA's data center business) begins to override the rate-hike overhang. The model catches the recovery and rides the subsequent bull run. But the shape of the drawdown is instructive: this model is not market-neutral. It carries full directional exposure to the Magnificent 7. In a sustained bear market, it will bleed.
The series of LR threshold equity curves — where all threshold lines overlap into what appears to be a single curve — are not boring. They are diagnostic. They show a model that has found one level of confidence it deems acceptable and applies it uniformly. The model does not have a spectrum of conviction; it has a binary state: "trade" or "don't trade."
The later versions of this chart (after switching to the median target and tuning C to 0.01) begin to show the stepped equity curve — a staircase rather than a smooth line. This is the model becoming more selective. It sits in cash for extended periods and then executes concentrated positions. The Calmar ratio improves because the drawdowns between stairs are smaller. This is progress.
The final chart, showing the optimized model with 880 trades, Precision 0.542, and MDD -0.110, tells the most mature story: a model that survives by selectivity rather than participation.
The two correlation matrix visualizations — one for the binned RSI version and one for the continuous RSI with rsi_trend — reveal the structural redundancy embedded in the feature set.
In the first matrix (binned RSI), the most striking pattern is the red cluster in the upper-left: 1d_return and vol_adj_return share a correlation of 0.86. 5d_return and dist_from_MA10 share 0.89. dist_from_MA20 and dist_from_MA10 share 0.83. This is a feature set where roughly half the features are describing variations of the same underlying signal: the price went up recently.
In the second matrix (continuous RSI with rsi_trend), the bottom-right corner reveals the problem that eventually motivates the binning decision: RSI and rsi_trend share a correlation of 0.998. They are for practical purposes the same number. Keeping both dilutes the weight that properly belongs to the interaction feature while providing zero additional information.
The correlation matrices also confirm which features carry genuinely independent information:
volume_changehas near-zero correlation with almost everythingstd_5andstd_10have low correlation with price-level features (0.04–0.09)volume_avg_ratiois largely orthogonal to trend features
These are the features doing real work. They are the features that cannot be reconstructed from the others.
The key judgment call — can this level of multicollinearity be tolerated? — has a nuanced answer. Under L2 regularization (the default for lbfgs solver), correlated features share weight rather than fight for it. The coefficients are smaller than they would be for independent features, but they are stable. A model with L1 regularization (Lasso) would eliminate redundant features outright. L2 retains them but quiets them.
Given the C=0.01 setting — strong regularization — the multicollinearity is largely absorbed. The weights remain coherent, as confirmed by the feature importance tables. But there is a hidden cost: the compressed, similar-range features contribute to the probability saturation problem. A more aggressively pruned feature set might produce better-differentiated probabilities.
This is the unresolved tension at the heart of the project: more features provide richer information but create saturation; fewer features produce cleaner probabilities but risk underfitting.
One of the most clarifying moments in the entire project is the introduction of the Transaction Cost Analysis. A strategy that achieves a Sharpe of 1.05 without costs collapses to 0.43 with a 10bps (0.10%) round-trip cost assumption.
This is the "Death by a Thousand Cuts" problem in quantitative trading. A model that predicts tomorrow's direction correctly 53% of the time sounds impressive. But if it trades every day, paying 0.10% each time, the friction compounds. Over 252 trading days, the annual cost is approximately 25.2% of portfolio value. The model needs to generate more than 25.2% in annual returns just to break even with transaction costs — an enormous hurdle.
The evolution toward selectivity is directly motivated by this arithmetic. A model that trades 758 times in the test period (roughly 10% of available opportunities) pays far less in aggregate friction than one that trades 4,198 times. Even if the per-trade edge is similar, the annual cost is 5x smaller.
The model that eventually achieves 880 trades, Precision 0.542, Sharpe 0.61 (with 20bps costs), and MDD -0.110 represents the best balance of selectivity and activity in the series. It still participates in the market's upward drift but avoids the daily friction tax of perpetual long exposure.
The most challenging self-examination comes when comparing the strategy's equity curve shape to a simple buy-and-hold of the Magnificent 7. The bear market dip and the bull market recovery are present in both. The algorithm's advantage is a slightly smoother path and a better Sharpe — but the story it tells is similar to the story the market tells.
This is the distinction between alpha and beta. Beta is market exposure — owning the market goes up when the market goes up. Alpha is excess return — outperforming the market through superior stock selection or timing. The current model exists in an ambiguous space. It is more selective than buy-and-hold (Recall of 0.216 means it skips 78% of potential trades), but the stocks it selects are the Magnificent 7 during a period when the Magnificent 7 broadly outperformed. It is difficult to disentangle model skill from sector tailwind.
The honest assessment: this model has not yet demonstrated genuine alpha. It has demonstrated survival — the ability to maintain positive equity with manageable drawdown while operating in a favorable macro environment. This is a meaningful achievement, but it is different from market-beating performance.
The calibration experiment fails primarily because it is applied to the wrong model configuration. A better application would involve: (1) training a base model on 2013–2022 data, (2) using 2022 data specifically as a validation set for calibration, and (3) evaluating on 2023–2024 holdout data. This temporal structure would force the calibrator to learn from a bear market — exactly the kind of distributional shift the model needs to understand.
One of the most promising undeveloped ideas is making the model sector-neutral by simultaneously going long the highest-probability stock and short the lowest-probability stock within the same group. This construction eliminates market beta entirely — if the whole Magnificent 7 drops 3%, the long-short position is roughly flat. The alpha lies purely in the relative performance of the long versus the short.
This is how professional cross-sectional models are typically deployed. The current implementation only takes long positions, which leaves the model fully exposed to directional market risk.
The project ends before a proper out-of-time test is conducted on assets outside the Magnificent 7. Testing the trained model on, say, mid-cap technology stocks or healthcare names would reveal whether the signals learned are specific to mega-cap growth momentum or represent a more general behavioral pattern. If the model fails on different assets, the "alpha" is largely a reflection of the Magnificent 7's particular volatility and momentum characteristics rather than universal principles.
The model that emerges from this iterative process has the following characteristics:
Data: 7 Magnificent 7 stocks, 2013–2024, full OHLCV, stacked into (Date, Ticker) MultiIndex panel format.
Target: next_day_return > 0 — a binary directional prediction — chosen for its simplicity and interpretability, despite acknowledging the superior theoretical properties of a sector-relative target.
Features:
- Cross-sectional Z-scored:
1d_return,vol_adj_return,5d_return,dist_from_MA20,dist_from_MA10,dist_from_MA5,std_5,std_10,volume_change - StandardScaler-normalized (train only):
volume_avg_ratio,sma_crossover,rsi_trend(binned RSI × sma_crossover),high_vol_regime
Model: Logistic Regression with L2 penalty, C=0.01, class_weight={0: 1.2, 1: 1.0}, scored by F1, cross-validated with TimeSeriesSplit(n_splits=5).
Trading rule: Enter long at open when predict_proba > 0.50 (no threshold optimization, since probabilities cluster and thresholds are effectively non-functional as levers).
| Metric | Value |
|---|---|
| Precision | 0.542 |
| Recall | 0.215 |
| Sharpe Ratio | 0.61 |
| Sortino Ratio | 0.44 |
| Profit Factor | 1.30 |
| Calmar Ratio | 1.27 |
| Max Drawdown | -0.110 |
| Trades (test period) | ~880 |
The equity curve tells a story of intermittent conviction. The model sits idle for long stretches, then executes clusters of trades when the RSI-trend interaction and volume signals align. The stepped nature of the equity curve reflects this selectivity.
The satisfaction here is not triumphant. It is the quiet, grounded satisfaction of having built something that works under specific conditions while maintaining clarity about those conditions' limitations.
The model is not ready for live deployment. Not because it is broken, but because several crucial gaps remain:
- The threshold freeze is not resolved — the model lacks the probability granularity to use confidence as a position-sizing lever.
- The Magnificent 7 context is a favorable one — the model has not been tested in a bear market where its mean-reversion assumptions would face sustained violations.
- Beta exposure is unhedged — without a short leg, the model participates in every broad market decline.
- The out-of-time test on different asset classes has not been conducted.
What is achieved is a proof of mechanism: the pipeline from raw OHLCV data through feature engineering, scaling, model training, and backtesting evaluation works correctly and produces coherent signals. The cross-sectional architecture is sound. The train/test scaler discipline is maintained. The financial metrics are computed correctly.
This is the foundation on which something deployable could eventually be built.
1. Probability Calibration via Holdout Regime The calibration approach failed when applied naively. A proper two-stage calibration — base model trained on 2013–2021, calibrated on 2022's bear market, evaluated on 2023–2024 — would provide the distributional range the sigmoid currently lacks. The bear market contains the negative-return regime the model must learn to respect.
2. Long-Short Construction Converting from long-only to long-short would neutralize beta and transform the performance metric from "did I beat zero?" to "did I beat the group?" This is the correct frame for a cross-sectional model. Implementation requires taking the top quintile of predicted probabilities long and bottom quintile short.
3. Feature Pruning
The correlation matrix clearly shows that several features are redundant. A more aggressive pruning — removing dist_from_MA5 (correlated 0.81 with dist_from_MA10) and 5d_return (correlated 0.89 with dist_from_MA10) — would reduce the feature space and potentially allow the model's weights to become more concentrated and its probabilities more differentiated.
4. Ensemble or Stacking The four models tested behaved differently in different regimes (LR outperformed in the bull market; RF lagged due to overfitting on 2022). An ensemble that blends their predictions — weighted by their rolling recent performance — might be more robust across regime transitions.
5. Genuine Out-of-Time Validation
The most important next step is testing the saved model (vector_alpha_lr_model.pkl) on a completely different universe — perhaps S&P 500 mid-caps or European large-caps. If the signal transfers, the model has found something real. If it fails, the model has found something about the Magnificent 7 specifically. Both outcomes are useful, but only the former justifies deployment.
6. Position Sizing as a Function of Confidence The threshold freeze problem reveals a deeper issue: the model cannot size positions differentially because it cannot rank its own confidence. A more sophisticated approach would involve generating a probability score, sorting all available stocks by that score each morning, and allocating capital proportionally to rank rather than using a binary cutoff. This is the natural extension of a cross-sectional framework.
This documentation represents the honest record of a learning process — not a finished system but a rigorous exploration. The market does not reward sincerity, but it does eventually punish self-deception. The discipline applied here — maintaining temporal boundaries, separating training from testing, grounding every design decision in financial logic rather than technical elegance — is the foundation of anything that could eventually survive.
Model artifacts produced: vector_alpha_lr_model.pkl
Codebase structure:
load_data.py— data acquisition and cachingfeature_engineering.py— feature computation and scaling pipelinemodel.py— GridSearchCV, threshold analysis, evaluationcorrelation_analysis.py— multicollinearity diagnosis