Bridging pure mathematics and machine learning β implementing Linear Regression from first principles using nothing but NumPy and the mathematics of Linear Algebra.
This project implements Linear Regression from scratch using core Linear Algebra concepts, as part of Linear Algebra for Computer Science (Math 204) at the Faculty of Computer & Information Sciences.
Rather than using black-box ML libraries, every algorithm is derived and implemented mathematically β then validated against industry-standard tools to prove correctness.
Dataset: Canadian Vehicle COβ Emissions (~7,385 vehicles)
Goal: Predict COβ emissions (g/km) from engine characteristics
Result: RΒ² = 0.7345, predictions within ~30 g/km of actual values
The exact solution to Linear Regression is derived by minimizing the least squares cost:
- One-shot β solves directly with no iterations
- Exact β gives the globally optimal solution
- Limitation β matrix inversion scales as O(nΒ³), expensive for large datasets
Instead of inverting a matrix, we follow the gradient of the cost function downhill:
- Iterative β converges over 1000 steps
- Scalable β works efficiently on massive datasets
- Key insight β arrives at the same weights as the Normal Equation
Prevents overfitting by adding a penalty term to the cost function:
The identity matrix I is modified so the bias term is never penalized (Iββ = 0), preserving the intercept while shrinking the feature weights.
| Method | MSE | RΒ² Score | Notes |
|---|---|---|---|
| Normal Equation (ours) | ~913 | 0.7345 | One-shot exact solution |
| Gradient Descent (ours) | ~913 | 0.7345 | Converges in ~100 iterations |
| Ridge Regression (Ξ»=10) | ~913 | 0.7345 | Slightly shrunk weights |
| scikit-learn (reference) | ~913 | 0.7345 | Industry standard |
Weight difference vs scikit-learn: 3.16e-13 β essentially zero, confirming mathematical accuracy.
| Predicted (g/km) | Actual (g/km) | Error |
|---|---|---|
| 250.8 | 253.0 | 2.2 |
| 304.5 | 344.0 | 39.5 |
| 347.7 | 322.0 | 25.7 |
The cost function drops from ~33,000 to ~447 in the first 100 iterations, then plateaus β confirming convergence to the optimal solution.
Cost
33000 |β
|β
| β
| ββ
1000 | ββββββββββββββββββββββββββββββββ
447 | βββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
0 200 400 600 800 1000
Iterations
Linear_Algebra_Project/
β
βββ π src/
β βββ model.py # Normal Equation + Gradient Descent + Ridge
β βββ utils.py # Data loading, feature selection, normalization
β βββ comparison.py # Three-way comparison with sklearn
β
βββ π data/
β βββ CO2_Emissions.csv # Canadian vehicle emissions dataset
β
βββ π plots/
β βββ gradient_descent_convergence.png
β
βββ π notebooks/
β βββ exploration.ipynb # Data exploration & visualization
β
βββ main.py # Full pipeline β runs all 4 tasks
βββ requirements.txt
βββ README.md
Two features were deliberately chosen over the full dataset:
| Feature | Reason |
|---|---|
| Engine Size (L) | Strong physical relationship with fuel burn |
| Cylinders | Structural engine complexity indicator |
Fuel consumption columns were excluded β they directly encode COβ (COβ β fuel burn), which would make the regression trivially easy and mathematically uninteresting.
- Load CSV β drop null rows
- Select meaningful features explicitly
- Normalize with
StandardScaler(critical for Gradient Descent convergence) - Add bias column (column of 1s) for the intercept term
- 80/20 train/test split
np.linalg.solve(Xα΅X, Xα΅y) is used instead of np.linalg.inv(Xα΅X) @ Xα΅y β solving the linear system directly is more numerically stable than explicitly computing the matrix inverse.
git clone https://github.com/AhmedMohammedRo/Linear_Algebra_Project
cd Linear_Algebra_Projectpip install -r requirements.txtpython main.pyThis will:
- Train all three models (Normal Equation, Gradient Descent, Ridge)
- Display the convergence plot
- Print the full comparison table in the terminal
- Show sample predictions vs actual values
jupyter notebook notebooks/exploration.ipynbnumpy
pandas
matplotlib
seaborn
scikit-learn
1. Two paths, one destination
Normal Equation and Gradient Descent both arrive at identical weights (diff < 1e-10), proving that the gradient of the least squares cost function has exactly one global minimum.
2. Normalization is not optional
Without StandardScaler, Gradient Descent either diverges or needs thousands more iterations. Feature scaling is what makes the cost surface spherical and easy to navigate.
3. Regularization is a linear algebra operation
Ridge regression adds Ξ»I to Xα΅X before inversion β this tiny change guarantees the matrix is invertible even when features are correlated, and shrinks weights to prevent overfitting.
4. Our implementation matches sklearn to 13 decimal places
This validates that the mathematics was implemented correctly with no shortcuts.
| Name |
|---|
| Omar Shaker |
| Ahmad Roshdy |
| Mark Tamer |
| Khalid Osam |
| Carlos Emad |
| Ahmad Fouad |
| Yousef Hany |
| Mohammad Elsayed |
| Course | Linear Algebra for Computer Science β Math 204 |
| Level | First-Year Undergraduate |
| Instructor | Dr. Doaa Elsakout |
| Academic Year | 2025 / 2026 |
| Deliverable | 15-minute group presentation |
| Weight | 10% of final grade |
Built with mathematics, not magic.