Mastering Linear Regression: From Scratch with Python and NumPy

Cover Image for Mastering Linear Regression: From Scratch with Python and NumPy
Breye Foka
Breye Foka

🏁 Introduction: The Quest for the Best Fit Line

Linear Regression is often the starting point for anyone journeying into machine learning and statistics. At its core, it solves a fundamental problem: given a set of data points representing pairs of inputs (XX) and outputs (YY), how can we find the straight line that best describes the relationship between them?

Our goal is to build a model that can predict a continuous target variable YY based on an independent variable XX. While libraries like Scikit-Learn do this in one line, building it from scratch using NumPy ensures a solid grasp of the underlying mathematics.

📐 The Mathematical Model

We assume that the relationship between our variables can be expressed by a linear equation:

y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon

Where:

  • yy is the actual target value.
  • xx is the input feature.
  • β0\beta_0 is the intercept (where the line crosses the y-axis).
  • β1\beta_1 is the slope (how much yy changes for every unit change in xx).
  • ϵ\epsilon represents the irreducible error or noise in real-world data.

Our model attempts to learn estimates of these parameters, denoted as β^0\hat{\beta}_0 and β^1\hat{\beta}_1, to make predictions:

y^i=β^0+β^1xi\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i

🎯 The Objective: Defining "Error"

How do we define the "best" line? We need a way to quantify how "wrong" our line is. We do this by calculating the residual (eie_i) for every data point—the vertical distance between the actual point and our predicted line:

ei=yiy^ie_i = y_i - \hat{y}_i

To measure the total error of the model, we cannot simply sum the errors, as positives and negatives would cancel out. Instead, we sum the squares of the errors. This is known as the Sum of Squared Residuals (SSR) or the Cost Function SS:

S(β0,β1)=i=1nei2=i=1n(yi(β0+β1xi))2S(\beta_0, \beta_1) = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2

Our entire goal is to find the specific values of β0\beta_0 and β1\beta_1 that result in the smallest possible value for SS.

⚙️ Method 1: Ordinary Least Squares (OLS) — The Closed-Form Solution

Ordinary Least Squares is a method used to find the exact mathematical solution that minimizes the error function SS. Because our cost function is a smooth, convex parabola, we can use calculus to find the minimum point by taking the partial derivatives and setting them to zero.

📉 Deriving the OLS Formulas

We need to minimize SS with respect to both parameters.

1. Derivative with respect to the intercept (β0\beta_0):

Using the chain rule:

Sβ0=i=1n2(yiβ0β1xi)(1)=2i=1n(yiβ0β1xi)\frac{\partial S}{\partial \beta_0} = \sum_{i=1}^{n} 2(y_i - \beta_0 - \beta_1 x_i)(-1) = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)

Set to 0 and simplify:

yinβ0β1xi=0\sum y_i - n\beta_0 - \beta_1 \sum x_i = 0

Dividing by nn gives us means (yˉ\bar{y} and xˉ\bar{x}), leading to our first key formula:

β0=yˉβ1xˉ\beta_0 = \bar{y} - \beta_1 \bar{x}

2. Derivative with respect to the slope (β1\beta_1):

Sβ1=i=1n2(yiβ0β1xi)(xi)=2i=1nxi(yiβ0β1xi)\frac{\partial S}{\partial \beta_1} = \sum_{i=1}^{n} 2(y_i - \beta_0 - \beta_1 x_i)(-x_i) = -2 \sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i)

Set to 0 and substitute the β0\beta_0 equation we just found. After significant algebraic rearrangement, we arrive at the final formula for the slope:

β^1=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}

This formula is essentially the covariance of X and Y divided by the variance of X.

💻 OLS NumPy Implementation

Because OLS provides a closed-form solution, there is no training loop. We just plug in the data.

import numpy as np

class OLSLinearRegression:
    def fit(self, X, y):
        # Calculate means
        x_mean = np.mean(X)
        y_mean = np.mean(y)

        # Calculate terms for the slope (beta_1) formula
        numerator = np.sum((X - x_mean) * (y - y_mean))
        denominator = np.sum((X - x_mean) ** 2)

        # Calculate slope and intercept
        self.beta_1 = numerator / denominator
        self.beta_0 = y_mean - (self.beta_1 * x_mean)

    def predict(self, X):
        return self.beta_0 + self.beta_1 * X

🔄 Method 2: Gradient Descent — The Iterative Approach

While OLS is exact, it can be computationally expensive for massive datasets with many features (due to matrix inversion requirements in multivariate cases). Gradient Descent is an iterative optimization algorithm used to find the minimum of a function, widely used throughout deep learning.

Instead of solving it in one go, we start with random parameters and take small steps downhill in the direction of the steepest descent.

We often use the Mean Squared Error (MSE) as our cost function JJ here, usually divided by 2 to make the derivative cleaner:

J(β0,β1)=12ni=1n(y^iyi)2J(\beta_0, \beta_1) = \frac{1}{2n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

📉 Calculating Gradients

We need the partial derivatives of the cost function JJ to know which way is "downhill":

Jβ0=1ni=1n(y^iyi)\frac{\partial J}{\partial \beta_0} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)

Jβ1=1ni=1n(y^iyi)xi\frac{\partial J}{\partial \beta_1} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)x_i

📦 Parameter Update Rule

We update parameters simultaneously using a learning rate (α\alpha), which controls how big a step we take:

β0:=β0αJβ0\beta_0 := \beta_0 - \alpha \frac{\partial J}{\partial \beta_0}

β1:=β1αJβ1\beta_1 := \beta_1 - \alpha \frac{\partial J}{\partial \beta_1}

💻 Gradient Descent NumPy Implementation

class GDLinearRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.beta_0 = 0
        self.beta_1 = 0

    def fit(self, X, y):
        n_samples = len(X)

        for _ in range(self.iterations):
            # 1. Forward pass (make predictions)
            y_predicted = self.beta_0 + self.beta_1 * X

            # 2. Calculate Gradients
            # (Using vectorization for efficiency instead of loops)
            d_beta1 = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            d_beta0 = (1 / n_samples) * np.sum(y_predicted - y)

            # 3. Update Parameters
            self.beta_1 -= self.lr * d_beta1
            self.beta_0 -= self.lr * d_beta0

    def predict(self, X):
        return self.beta_0 + self.beta_1 * X

📚 Literature and Context

The roots of linear regression run deep into statistical history. The method of least squares was first published by Adrien-Marie Legendre in 1805, though Carl Friedrich Gauss claimed to have used it earlier to predict the location of the dwarf planet Ceres.

The term "regression" itself comes from Francis Galton in the 19th century. In studying heredity, he observed that sons of tall fathers tended to be taller than average, but shorter than their fathers—a phenomenon he called "regression towards mediocrity" (now regression to the mean).

Statistically, under the assumption that the error terms ϵ\epsilon are normally distributed, the OLS estimator is also the Maximum Likelihood Estimator (MLE). Furthermore, the Gauss-Markov theorem states that OLS estimators are the Best Linear Unbiased Estimators (BLUE), meaning they have the lowest variance among all unbiased linear estimators.

🧠 Final Thoughts

We have explored two distinct ways to arrive at the same result:

  • OLS: Utilizing calculus to derive a precise, closed-form mathematical solution. It is fast for smaller datasets.
  • Gradient Descent: Utilizing iterative optimization to creep down the error landscape. This is the foundational concept for training complex neural networks.

By implementing both from scratch, we move beyond viewing machine learning models as "black boxes" and gain an appreciation for the elegant mathematics that drives predictions.

📌 Next Steps

  • Expand the implementation to Multiple Linear Regression (using matrix notation Y=Xβ+ϵY = X\beta + \epsilon) to handle multiple input features.
  • Implement Polynomial Regression to capture non-linear relationships.
  • Add regularization terms (L1 Lasso and L2 Ridge) to prevent overfitting.