Mastering Linear Regression: From Scratch with Python and NumPy

🏁 Introduction: The Quest for the Best Fit Line
Linear Regression is often the starting point for anyone journeying into machine learning and statistics. At its core, it solves a fundamental problem: given a set of data points representing pairs of inputs () and outputs (), how can we find the straight line that best describes the relationship between them?
Our goal is to build a model that can predict a continuous target variable based on an independent variable . While libraries like Scikit-Learn do this in one line, building it from scratch using NumPy ensures a solid grasp of the underlying mathematics.
📐 The Mathematical Model
We assume that the relationship between our variables can be expressed by a linear equation:
Where:
- is the actual target value.
- is the input feature.
- is the intercept (where the line crosses the y-axis).
- is the slope (how much changes for every unit change in ).
- represents the irreducible error or noise in real-world data.
Our model attempts to learn estimates of these parameters, denoted as and , to make predictions:
🎯 The Objective: Defining "Error"
How do we define the "best" line? We need a way to quantify how "wrong" our line is. We do this by calculating the residual () for every data point—the vertical distance between the actual point and our predicted line:
To measure the total error of the model, we cannot simply sum the errors, as positives and negatives would cancel out. Instead, we sum the squares of the errors. This is known as the Sum of Squared Residuals (SSR) or the Cost Function :
Our entire goal is to find the specific values of and that result in the smallest possible value for .
⚙️ Method 1: Ordinary Least Squares (OLS) — The Closed-Form Solution
Ordinary Least Squares is a method used to find the exact mathematical solution that minimizes the error function . Because our cost function is a smooth, convex parabola, we can use calculus to find the minimum point by taking the partial derivatives and setting them to zero.
📉 Deriving the OLS Formulas
We need to minimize with respect to both parameters.
1. Derivative with respect to the intercept ():
Using the chain rule:
Set to 0 and simplify:
Dividing by gives us means ( and ), leading to our first key formula:
2. Derivative with respect to the slope ():
Set to 0 and substitute the equation we just found. After significant algebraic rearrangement, we arrive at the final formula for the slope:
This formula is essentially the covariance of X and Y divided by the variance of X.
💻 OLS NumPy Implementation
Because OLS provides a closed-form solution, there is no training loop. We just plug in the data.
import numpy as np
class OLSLinearRegression:
def fit(self, X, y):
# Calculate means
x_mean = np.mean(X)
y_mean = np.mean(y)
# Calculate terms for the slope (beta_1) formula
numerator = np.sum((X - x_mean) * (y - y_mean))
denominator = np.sum((X - x_mean) ** 2)
# Calculate slope and intercept
self.beta_1 = numerator / denominator
self.beta_0 = y_mean - (self.beta_1 * x_mean)
def predict(self, X):
return self.beta_0 + self.beta_1 * X
🔄 Method 2: Gradient Descent — The Iterative Approach
While OLS is exact, it can be computationally expensive for massive datasets with many features (due to matrix inversion requirements in multivariate cases). Gradient Descent is an iterative optimization algorithm used to find the minimum of a function, widely used throughout deep learning.
Instead of solving it in one go, we start with random parameters and take small steps downhill in the direction of the steepest descent.
We often use the Mean Squared Error (MSE) as our cost function here, usually divided by 2 to make the derivative cleaner:
📉 Calculating Gradients
We need the partial derivatives of the cost function to know which way is "downhill":
📦 Parameter Update Rule
We update parameters simultaneously using a learning rate (), which controls how big a step we take:
💻 Gradient Descent NumPy Implementation
class GDLinearRegression:
def __init__(self, learning_rate=0.01, iterations=1000):
self.lr = learning_rate
self.iterations = iterations
self.beta_0 = 0
self.beta_1 = 0
def fit(self, X, y):
n_samples = len(X)
for _ in range(self.iterations):
# 1. Forward pass (make predictions)
y_predicted = self.beta_0 + self.beta_1 * X
# 2. Calculate Gradients
# (Using vectorization for efficiency instead of loops)
d_beta1 = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
d_beta0 = (1 / n_samples) * np.sum(y_predicted - y)
# 3. Update Parameters
self.beta_1 -= self.lr * d_beta1
self.beta_0 -= self.lr * d_beta0
def predict(self, X):
return self.beta_0 + self.beta_1 * X
📚 Literature and Context
The roots of linear regression run deep into statistical history. The method of least squares was first published by Adrien-Marie Legendre in 1805, though Carl Friedrich Gauss claimed to have used it earlier to predict the location of the dwarf planet Ceres.
The term "regression" itself comes from Francis Galton in the 19th century. In studying heredity, he observed that sons of tall fathers tended to be taller than average, but shorter than their fathers—a phenomenon he called "regression towards mediocrity" (now regression to the mean).
Statistically, under the assumption that the error terms are normally distributed, the OLS estimator is also the Maximum Likelihood Estimator (MLE). Furthermore, the Gauss-Markov theorem states that OLS estimators are the Best Linear Unbiased Estimators (BLUE), meaning they have the lowest variance among all unbiased linear estimators.
🧠 Final Thoughts
We have explored two distinct ways to arrive at the same result:
- OLS: Utilizing calculus to derive a precise, closed-form mathematical solution. It is fast for smaller datasets.
- Gradient Descent: Utilizing iterative optimization to creep down the error landscape. This is the foundational concept for training complex neural networks.
By implementing both from scratch, we move beyond viewing machine learning models as "black boxes" and gain an appreciation for the elegant mathematics that drives predictions.
📌 Next Steps
- Expand the implementation to Multiple Linear Regression (using matrix notation ) to handle multiple input features.
- Implement Polynomial Regression to capture non-linear relationships.
- Add regularization terms (L1 Lasso and L2 Ridge) to prevent overfitting.