Skip to content

AhmedMohammedRo/Linear_Algebra_Project

Repository files navigation

πŸ“‰ Linear Regression from Scratch

Linear Algebra for Computer Science β€” Project 23

Python NumPy scikit-learn Status

Bridging pure mathematics and machine learning β€” implementing Linear Regression from first principles using nothing but NumPy and the mathematics of Linear Algebra.


🎯 Project Overview

This project implements Linear Regression from scratch using core Linear Algebra concepts, as part of Linear Algebra for Computer Science (Math 204) at the Faculty of Computer & Information Sciences.

Rather than using black-box ML libraries, every algorithm is derived and implemented mathematically β€” then validated against industry-standard tools to prove correctness.

Dataset: Canadian Vehicle COβ‚‚ Emissions (~7,385 vehicles)
Goal: Predict COβ‚‚ emissions (g/km) from engine characteristics
Result: RΒ² = 0.7345, predictions within ~30 g/km of actual values


🧠 The Mathematics

Normal Equation β€” Closed Form Solution

The exact solution to Linear Regression is derived by minimizing the least squares cost:

$$\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$

  • One-shot β€” solves directly with no iterations
  • Exact β€” gives the globally optimal solution
  • Limitation β€” matrix inversion scales as O(nΒ³), expensive for large datasets

Gradient Descent β€” Iterative Optimization

Instead of inverting a matrix, we follow the gradient of the cost function downhill:

$$J(\mathbf{w}) = \frac{1}{2m} |\mathbf{X}\mathbf{w} - \mathbf{y}|^2$$

$$\mathbf{w} := \mathbf{w} - \alpha \cdot \frac{1}{m} \mathbf{X}^\top (\mathbf{X}\mathbf{w} - \mathbf{y})$$

  • Iterative β€” converges over 1000 steps
  • Scalable β€” works efficiently on massive datasets
  • Key insight β€” arrives at the same weights as the Normal Equation

Ridge Regression β€” Regularization

Prevents overfitting by adding a penalty term to the cost function:

$$\mathbf{w} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}$$

The identity matrix I is modified so the bias term is never penalized (Iβ‚€β‚€ = 0), preserving the intercept while shrinking the feature weights.


πŸ“Š Results

Method Comparison

Method MSE RΒ² Score Notes
Normal Equation (ours) ~913 0.7345 One-shot exact solution
Gradient Descent (ours) ~913 0.7345 Converges in ~100 iterations
Ridge Regression (Ξ»=10) ~913 0.7345 Slightly shrunk weights
scikit-learn (reference) ~913 0.7345 Industry standard

Weight difference vs scikit-learn: 3.16e-13 β€” essentially zero, confirming mathematical accuracy.

Sample Predictions

Predicted (g/km) Actual (g/km) Error
250.8 253.0 2.2
304.5 344.0 39.5
347.7 322.0 25.7

Gradient Descent Convergence

The cost function drops from ~33,000 to ~447 in the first 100 iterations, then plateaus β€” confirming convergence to the optimal solution.

Cost
33000 |β–ˆ
      |β–ˆ
      | β–ˆ
      |  β–ˆβ–ˆ
 1000 |    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
  447 |                                    ─────────────
      └─────────────────────────────────────────────────
      0        200       400       600       800      1000
                              Iterations

πŸ—οΈ Project Structure

Linear_Algebra_Project/
β”‚
β”œβ”€β”€ πŸ“‚ src/
β”‚   β”œβ”€β”€ model.py          # Normal Equation + Gradient Descent + Ridge
β”‚   β”œβ”€β”€ utils.py          # Data loading, feature selection, normalization
β”‚   └── comparison.py     # Three-way comparison with sklearn
β”‚
β”œβ”€β”€ πŸ“‚ data/
β”‚   └── CO2_Emissions.csv # Canadian vehicle emissions dataset
β”‚
β”œβ”€β”€ πŸ“‚ plots/
β”‚   └── gradient_descent_convergence.png
β”‚
β”œβ”€β”€ πŸ“‚ notebooks/
β”‚   └── exploration.ipynb # Data exploration & visualization
β”‚
β”œβ”€β”€ main.py               # Full pipeline β€” runs all 4 tasks
β”œβ”€β”€ requirements.txt
└── README.md

βš™οΈ Implementation Details

Feature Engineering

Two features were deliberately chosen over the full dataset:

Feature Reason
Engine Size (L) Strong physical relationship with fuel burn
Cylinders Structural engine complexity indicator

Fuel consumption columns were excluded β€” they directly encode COβ‚‚ (COβ‚‚ ∝ fuel burn), which would make the regression trivially easy and mathematically uninteresting.

Data Pipeline

  1. Load CSV β†’ drop null rows
  2. Select meaningful features explicitly
  3. Normalize with StandardScaler (critical for Gradient Descent convergence)
  4. Add bias column (column of 1s) for the intercept term
  5. 80/20 train/test split

Numerical Stability

np.linalg.solve(Xα΅€X, Xα΅€y) is used instead of np.linalg.inv(Xα΅€X) @ Xα΅€y β€” solving the linear system directly is more numerically stable than explicitly computing the matrix inverse.


πŸš€ How to Run

1. Clone the repository

git clone https://github.com/AhmedMohammedRo/Linear_Algebra_Project
cd Linear_Algebra_Project

2. Install dependencies

pip install -r requirements.txt

3. Run the full pipeline

python main.py

This will:

  • Train all three models (Normal Equation, Gradient Descent, Ridge)
  • Display the convergence plot
  • Print the full comparison table in the terminal
  • Show sample predictions vs actual values

4. Explore the notebook (optional)

jupyter notebook notebooks/exploration.ipynb

πŸ“¦ Dependencies

numpy
pandas
matplotlib
seaborn
scikit-learn

πŸ”‘ Key Takeaways

1. Two paths, one destination
Normal Equation and Gradient Descent both arrive at identical weights (diff < 1e-10), proving that the gradient of the least squares cost function has exactly one global minimum.

2. Normalization is not optional
Without StandardScaler, Gradient Descent either diverges or needs thousands more iterations. Feature scaling is what makes the cost surface spherical and easy to navigate.

3. Regularization is a linear algebra operation
Ridge regression adds Ξ»I to Xα΅€X before inversion β€” this tiny change guarantees the matrix is invertible even when features are correlated, and shrinks weights to prevent overfitting.

4. Our implementation matches sklearn to 13 decimal places
This validates that the mathematics was implemented correctly with no shortcuts.


πŸ‘₯ Team

Name
Omar Shaker
Ahmad Roshdy
Mark Tamer
Khalid Osam
Carlos Emad
Ahmad Fouad
Yousef Hany
Mohammad Elsayed

πŸ“š Course Information

Course Linear Algebra for Computer Science β€” Math 204
Level First-Year Undergraduate
Instructor Dr. Doaa Elsakout
Academic Year 2025 / 2026
Deliverable 15-minute group presentation
Weight 10% of final grade

Built with mathematics, not magic.

About

No description, website, or topics provided.

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors