Mastering Linear Regression: From Scratch with Python and NumPy

Breye Foka

Cover Image for Mastering Linear Regression: From Scratch with Python and NumPy

Breye Foka

January 16, 2026

🏁 Introduction: The Quest for the Best Fit Line

Linear Regression is often the starting point for anyone journeying into machine learning and statistics. At its core, it solves a fundamental problem: given a set of data points representing pairs of inputs ( $X$ ) and outputs ( $Y$ ), how can we find the straight line that best describes the relationship between them?

Our goal is to build a model that can predict a continuous target variable $Y$ based on an independent variable $X$ . While libraries like Scikit-Learn do this in one line, building it from scratch using NumPy ensures a solid grasp of the underlying mathematics.

📐 The Mathematical Model

We assume that the relationship between our variables can be expressed by a linear equation:

$y = \beta_0 + \beta_1 x + \epsilon$

Where:

$y$ is the actual target value.
$x$ is the input feature.
$\beta_0$ is the intercept (where the line crosses the y-axis).
$\beta_1$ is the slope (how much $y$ changes for every unit change in $x$ ).
$\epsilon$ represents the irreducible error or noise in real-world data.

Our model attempts to learn estimates of these parameters, denoted as $\hat{\beta}_0$ and $\hat{\beta}_1$ , to make predictions:

$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$

🎯 The Objective: Defining "Error"

How do we define the "best" line? We need a way to quantify how "wrong" our line is. We do this by calculating the residual ( $e_i$ ) for every data point—the vertical distance between the actual point and our predicted line:

$e_i = y_i - \hat{y}_i$

To measure the total error of the model, we cannot simply sum the errors, as positives and negatives would cancel out. Instead, we sum the squares of the errors. This is known as the Sum of Squared Residuals (SSR) or the Cost Function $S$ :

$S(\beta_0, \beta_1) = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2$

Our entire goal is to find the specific values of $\beta_0$ and $\beta_1$ that result in the smallest possible value for $S$ .

⚙️ Method 1: Ordinary Least Squares (OLS) — The Closed-Form Solution

Ordinary Least Squares is a method used to find the exact mathematical solution that minimizes the error function $S$ . Because our cost function is a smooth, convex parabola, we can use calculus to find the minimum point by taking the partial derivatives and setting them to zero.

📉 Deriving the OLS Formulas

We need to minimize $S$ with respect to both parameters.

1. Derivative with respect to the intercept ( $\beta_0$ ):

Using the chain rule:

$\frac{\partial S}{\partial \beta_0} = \sum_{i=1}^{n} 2(y_i - \beta_0 - \beta_1 x_i)(-1) = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)$

Set to 0 and simplify:

$\sum y_i - n\beta_0 - \beta_1 \sum x_i = 0$

Dividing by $n$ gives us means ( $\bar{y}$ and $\bar{x}$ ), leading to our first key formula:

$\beta_0 = \bar{y} - \beta_1 \bar{x}$

2. Derivative with respect to the slope ( $\beta_1$ ):

$\frac{\partial S}{\partial \beta_1} = \sum_{i=1}^{n} 2(y_i - \beta_0 - \beta_1 x_i)(-x_i) = -2 \sum_{i=1}^{n} x_i(y_i - \beta_0 - \beta_1 x_i)$

Set to 0 and substitute the $\beta_0$ equation we just found. After significant algebraic rearrangement, we arrive at the final formula for the slope:

$\hat{\beta}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$

This formula is essentially the covariance of X and Y divided by the variance of X.

💻 OLS NumPy Implementation

Because OLS provides a closed-form solution, there is no training loop. We just plug in the data.

import numpy as np

class OLSLinearRegression:
    def fit(self, X, y):
        # Calculate means
        x_mean = np.mean(X)
        y_mean = np.mean(y)

        # Calculate terms for the slope (beta_1) formula
        numerator = np.sum((X - x_mean) * (y - y_mean))
        denominator = np.sum((X - x_mean) ** 2)

        # Calculate slope and intercept
        self.beta_1 = numerator / denominator
        self.beta_0 = y_mean - (self.beta_1 * x_mean)

    def predict(self, X):
        return self.beta_0 + self.beta_1 * X

🔄 Method 2: Gradient Descent — The Iterative Approach

While OLS is exact, it can be computationally expensive for massive datasets with many features (due to matrix inversion requirements in multivariate cases). Gradient Descent is an iterative optimization algorithm used to find the minimum of a function, widely used throughout deep learning.

Instead of solving it in one go, we start with random parameters and take small steps downhill in the direction of the steepest descent.

We often use the Mean Squared Error (MSE) as our cost function $J$ here, usually divided by 2 to make the derivative cleaner:

$J(\beta_0, \beta_1) = \frac{1}{2n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$

📉 Calculating Gradients

We need the partial derivatives of the cost function $J$ to know which way is "downhill":

$\frac{\partial J}{\partial \beta_0} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$

$\frac{\partial J}{\partial \beta_1} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)x_i$

📦 Parameter Update Rule

We update parameters simultaneously using a learning rate ( $\alpha$ ), which controls how big a step we take:

$\beta_0 := \beta_0 - \alpha \frac{\partial J}{\partial \beta_0}$

$\beta_1 := \beta_1 - \alpha \frac{\partial J}{\partial \beta_1}$

💻 Gradient Descent NumPy Implementation

class GDLinearRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.beta_0 = 0
        self.beta_1 = 0

    def fit(self, X, y):
        n_samples = len(X)

        for _ in range(self.iterations):
            # 1. Forward pass (make predictions)
            y_predicted = self.beta_0 + self.beta_1 * X

            # 2. Calculate Gradients
            # (Using vectorization for efficiency instead of loops)
            d_beta1 = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            d_beta0 = (1 / n_samples) * np.sum(y_predicted - y)

            # 3. Update Parameters
            self.beta_1 -= self.lr * d_beta1
            self.beta_0 -= self.lr * d_beta0

    def predict(self, X):
        return self.beta_0 + self.beta_1 * X

📚 Literature and Context

The roots of linear regression run deep into statistical history. The method of least squares was first published by Adrien-Marie Legendre in 1805, though Carl Friedrich Gauss claimed to have used it earlier to predict the location of the dwarf planet Ceres.

The term "regression" itself comes from Francis Galton in the 19th century. In studying heredity, he observed that sons of tall fathers tended to be taller than average, but shorter than their fathers—a phenomenon he called "regression towards mediocrity" (now regression to the mean).

Statistically, under the assumption that the error terms $\epsilon$ are normally distributed, the OLS estimator is also the Maximum Likelihood Estimator (MLE). Furthermore, the Gauss-Markov theorem states that OLS estimators are the Best Linear Unbiased Estimators (BLUE), meaning they have the lowest variance among all unbiased linear estimators.

🧠 Final Thoughts

We have explored two distinct ways to arrive at the same result:

OLS: Utilizing calculus to derive a precise, closed-form mathematical solution. It is fast for smaller datasets.
Gradient Descent: Utilizing iterative optimization to creep down the error landscape. This is the foundational concept for training complex neural networks.

By implementing both from scratch, we move beyond viewing machine learning models as "black boxes" and gain an appreciation for the elegant mathematics that drives predictions.

📌 Next Steps

Expand the implementation to Multiple Linear Regression (using matrix notation $Y = X\beta + \epsilon$ ) to handle multiple input features.
Implement Polynomial Regression to capture non-linear relationships.
Add regularization terms (L1 Lasso and L2 Ridge) to prevent overfitting.