619,041 trading records across 500 companies and 11 sectors from 2013 to 2018. I wanted to understand which sectors actually delivered, where the risk was hiding, and whether paying a premium valuation was worth it.
I built this to practice writing real SQL on a dataset large enough to actually stress-test the queries, and to think through sector-level market analysis the way a finance team would approach it rather than just calculating basic returns and calling it done.
The dataset covers daily stock prices for S&P 500 companies from early 2013 through early 2018, joined with a fundamentals table containing P/E ratios, market cap, EBITDA, dividend yield, and earnings per share. Putting those two together made it possible to connect price performance with underlying financial health rather than looking at either in isolation.
Three things I wanted to figure out:
What sectors generated the strongest returns and at what level of risk? Did expensive valuations predict better performance? And where was investor attention actually going over time, as measured by trading volume?
Dashboard: Tableau Public
Dataset 1: S&P 500 Historical Stock Data
Dataset 2: S&P 500 Companies with Financial Information
MySQL Workbench for all the data work. Tableau Public for the dashboard. No Python, no additional libraries. Just SQL and visualization.
SQL techniques used across the 10 queries: JOIN, GROUP BY, subqueries, CTEs, chained CTEs, RANK() OVER PARTITION BY, LAG(), rolling averages with ROWS BETWEEN, UNBOUNDED PRECEDING, CASE WHEN with scalar subqueries, temporary tables, query optimization, STR_TO_DATE() for type conversion, and backtick escaping for column names with special characters.
The stock price table has 619,041 rows of daily data. The fundamentals table has 505 company records. They join on ticker symbol.
One thing worth flagging upfront: the dataset runs from February 2013 to February 2018, not full calendar years. 2013 and 2018 each have around 40 trading days compared to roughly 252 for complete years like 2014 through 2016. This matters for two specific analyses.
Volume figures for 2013 and 2018 are not comparable to the full years. The roughly 87% year-over-year volume decline that appears in 2018 is not a real market event. It is a data artifact from the partial year. I kept those figures in rather than filtering them out, noted the limitation clearly in the dashboard, and documented it here so anyone reading this understands what they are looking at.
Five-year returns are calculated using the first available price in early February 2013 against the last available price in early February 2018. It is approximately five years, not exactly five calendar years.
One other thing to flag in the fundamentals data: several column names have spaces and slashes in them, Price/Earnings, Market Cap, Dividend Yield, and others. Every query that touches those columns wraps them in backticks. It is a small detail but without it every query throws a syntax error, and it is the kind of thing that only shows up when you actually run the queries against real data.
Query 1 calculates average closing price by sector and year. This is the base of everything else. The date column came in as text rather than a date type, so every query that involves dates uses STR_TO_DATE to convert it first. Information Technology went from $52.50 average in 2013 to $128.70 in 2018. Telecom went from $39.70 to $35.80, the only sector that ended lower than it started.
Query 2 calculates the five-year price return per sector using first and last available prices. It uses nested subqueries to find the earliest and latest trading dates per ticker and retrieves the closing price on each of those days. The limitation here is that point-in-time prices can be sensitive to single-day anomalies. A cleaner approach would compare annual averages from Query 1, which smooths out daily noise. Both approaches tell a similar story directionally.
Query 3 measures volatility as average daily price range divided by closing price, expressed as a percentage. Dividing by price is important because an absolute dollar range means something very different for a $10 stock versus a $200 stock. Energy came in at 2.72% average daily volatility, nearly double Consumer Staples at 1.53%.
Query 4 looks at annual trading volume by sector. It reports both total volume and average daily volume per company because total volume can be misleading if one sector simply has more companies than another.
Query 5 is the financial health scorecard using the fundamentals table only. No join with stock prices needed here. Market cap and EBITDA figures are divided by a billion to make them readable. Raw numbers like 97,140,000,000 tell you nothing at a glance.
One finding worth explaining: Telecom shows an average market cap of $151 billion and average EBITDA of $33.7 billion, both dramatically higher than other sectors. This is accurate but requires context. Telecom in this dataset has only three companies: AT&T with $226 billion in market cap, Verizon with $208 billion, and CenturyLink with $18 billion. Two of the three are among the largest companies in the entire S&P 500. When you average across three companies skewed that heavily, the result stops being representative of a typical Telecom company. It is a good example of why sector averages need to be read alongside company count, which is why I included that field.
Query 6 ranks the top five performing stocks within each sector by five-year return. The first version of this query timed out after 30 seconds. MySQL was scanning all 619,041 rows multiple times through deeply nested subqueries. The fix was a temporary table that pre-calculates the first and last price for every ticker once, then the main query reads from that smaller result. The ranking itself uses RANK() OVER (PARTITION BY Sector ORDER BY return DESC), which resets the rank counter for each sector independently so every sector gets its own top five rather than one global list.
The numbers that came out of this one were striking. NVDA delivered 1,750% over the period. This was before Nvidia became a household name tied to AI. The data covers the years when GPU computing was beginning to underpin machine learning infrastructure and the market was starting to price that in, though not yet at the scale that happened after 2023. NFLX came in at 923%, covering the period when Netflix completed its transformation from a DVD service into a streaming platform. STZ, Constellation Brands, a beer and wine company, delivered 572% in Consumer Staples after acquiring the US rights to Corona. The best returns often show up in unexpected places.
Query 7 calculates year-over-year volume change using LAG(). LAG looks back one row within a window. PARTITION BY Sector ensures it only looks back within the same sector, so 2015 Financials compares against 2014 Financials and not 2014 Energy. The first year returns NULL from LAG since there is no prior year to compare against, filtered out with WHERE Prev_Year_Volume IS NOT NULL. The 2018 figures show large declines for the partial year reason described above.
Query 8 builds a 30-day rolling average price per sector using AVG() OVER with ROWS BETWEEN 29 PRECEDING AND CURRENT ROW. The window covers the current day plus the 29 days before it, always exactly 30 trading days. I used ROWS BETWEEN rather than RANGE BETWEEN because ROWS operates on physical row positions, which is more precise when trading dates have gaps from weekends and holidays. Thirty trading days is roughly six calendar weeks, which is a standard smoothing window in financial analysis.
Query 9 calculates maximum drawdown per sector using three chained CTEs. The first gets daily sector prices. The second tracks the running historical peak using MAX() OVER with UNBOUNDED PRECEDING, meaning it goes all the way back to the first row in the partition to find the true historical maximum at any given point in time. The third measures how far below that peak each day sits. Since drawdowns are always zero or negative, MIN() gives the most negative value, which is the worst decline from peak. Energy showed the deepest drawdown, consistent with the oil price collapse visible in the heatmap where Energy peaked in 2014 and declined through 2016.
Query 10 tests whether sectors with high P/E ratios outperformed sectors with lower valuations. A scalar subquery inside CASE WHEN calculates the overall average P/E across all sectors and classifies each sector as High Valuation or Low Valuation relative to that benchmark. The classification is dynamic rather than using an arbitrary threshold. Two CTEs join at the final SELECT: one for five-year returns, one for valuation classification. The result shows whether paying a premium for growth sectors was justified by actual performance over this specific period.
| Technique | Queries |
|---|---|
| JOIN across tables | All queries |
| STR_TO_DATE type conversion | 1, 2, 3, 4, 6, 7, 8, 9 |
| GROUP BY with aggregates | 1, 2, 3, 4, 5 |
| Subqueries | 2, 6, 10 |
| CTEs | 6, 7, 8, 9, 10 |
| Chained CTEs | 7, 8, 9, 10 |
| RANK() OVER PARTITION BY | 6, 10 |
| LAG() window function | 7 |
| Rolling window ROWS BETWEEN | 8 |
| UNBOUNDED PRECEDING | 9 |
| CASE WHEN | 10 |
| Scalar subquery in CASE WHEN | 10 |
| Temporary tables | 6 |
| Query optimization | 6 |
| Backtick escaping | 5, 10 |
Two dashboards published as a Tableau Story.
The first covers sector performance: a dual axis line chart showing price trends and company count together, a lollipop chart showing five-year returns by sector, and a horizontal bar chart for volatility. The dual axis on the price trend chart came from wanting to show both price appreciation and the number of companies contributing to each sector average at the same time, since sectors with more companies produce more stable averages.
The second dashboard goes deeper: a stacked area chart for trading volume over time with a note about the partial year data, a financial health scorecard with P/E, EPS, dividend yield, market cap, and EBITDA side by side, and a horizontal bar chart showing the top five stocks per sector with an interactive filter so you can drill into any sector.
The sector filter on Dashboard 2 updates all three charts simultaneously.
Information Technology led five-year returns from the lowest starting price of any sector, which is why percentage return tells a different story than absolute price level. Health Care had higher average stock prices throughout the period but lower percentage growth because it started from a higher base.
Energy was the worst sector by any risk-adjusted measure. Highest volatility, near-zero returns. The drawdown analysis shows the oil price collapse clearly with a peak in 2014 that was never recovered within the dataset period.
Individual stock returns bore little resemblance to sector averages. NVDA at 1,750% and NFLX at 923% are so far above their sector averages that the sector figure stops being useful for understanding what actually happened. That is part of why the top stocks query exists.
The Telecom metrics need the context explained earlier to be interpreted correctly. Numbers are accurate, but three companies including two of America's largest make the sector averages unrepresentative of a typical Telecom business.
Query 10 on valuation versus performance is the one I find most interesting from a finance perspective because it puts empirical data against a debate that usually stays theoretical.
The dataset covers February 2013 to February 2018 with partial years at both ends. Five-year returns use point-in-time prices rather than annual averages. The fundamentals table is a single snapshot rather than a time series, so P/E and market cap figures reflect one moment rather than the full five-year period. Sector composition varies significantly from 3 companies in Telecom to 81 in Consumer Discretionary, which affects how much sector averages actually mean. Query 9 drawdown uses sector-level daily averages rather than individual stock drawdowns, which smooths out single-stock extreme events.
sp500-sector-performance-analysis/
│
├── data/
│ ├── all_stocks_5yr.csv
│ ├── sp500_financials.csv
│ ├── sector_annual_returns.csv
│ ├── sector_5yr_return.csv
│ ├── sector_volatility.csv
│ ├── sector_volume_yoy.csv
│ ├── sector_financial_scorecard.csv
│ └── top_stocks_per_sector.csv
│
├── sql/
│ └── sp500_analysis_ZeelShah.sql
│
├── dashboard/
│ └── SP500_Analysis_ZeelShah.twbx
│
└── README.md
Zeel Shah. Finance background with an MSBA from UC Davis. CFA Level II candidate. Currently looking for financial analyst and data analyst roles in the Bay Area where I can put both sides of this kind of work to use.