How Statistical Regressions Learn to Trust: OLS, WLS, and GLS in Action

Last Updated on October 18, 2025 by Editorial Team

Author(s): Sohom Majumder

Originally published on Towards AI.

A beginner-friendly guide to OLS, WLS, and GLS regression in Python — complete with visuals, code, and clear examples for better model selection.

Linear regression, at its heart, is a conversation between data points and a line. Each observation makes a claim — “I’m part of the trend!” — and Ordinary Least Squares (OLS) assumes every voice in that conversation is equally clear and trustworthy. Under OLS, all data points are treated as if they were measured under the same conditions, with the same precision, and with no mutual influence.

Real data rarely oblige. Some measurements are noisy; others are systematically more reliable. Time series data bring yet another complication: observations start to “remember” each other, so that today’s error echoes yesterday’s. In such cases, the elegant simplicity of OLS becomes a limitation. It still finds a line, but that line can be skewed by loud, unreliable, or correlated points.

To address these imperfections, two extensions of OLS — Weighted Least Squares (WLS) and Generalized Least Squares (GLS) — step in. WLS acknowledges that some data points deserve a louder or softer voice by weighting each observation according to its reliability. GLS goes even further, recognizing that data points may not be independent at all, and explicitly modeling the structure of correlation among errors.

Let’s assume a 5 row dataset. Here, we’re modeling exam score vs. study hours for five students. As hours increase, performance variance also grows. We created the data using the follwoing code:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

data = pd.DataFrame({
 'Hours': [1, 2, 3, 4, 5],
 'Score': [52, 56, 63, 70, 75]
})

How Statistical Regressions Learn to Trust: OLS, WLS, and GLS in Action

Let’s plot the data now:

Data Plotting: Study hours (X) vs Exam Scores (y)

X = sm.add_constant(data['Hours'])
y = data['Score']

y = data['Score'] → tells the model what we’re predicting.

sm.add_constant(X) → tells the model to include an intercept so the line can move vertically to fit the data better.

Without it, you’re forcing the line to pass through (0,0).
With it, you allow the model to find the right starting height.

OLS, or Ordinary Least Squares, tries to draw the straight line that fits your data points “best.” But what does best mean mathematically?

It means OLS looks at every data point, calculates how far each point’s actual value is from the predicted value on the line, squares that distance (so negatives don’t cancel positives), and then adds them all up.

ols = sm.OLS(y, X).fit()
data['OLS_pred'] = ols.predict(X)


plt.scatter(data['Hours'], data['Score'], label='Data')
plt.plot(data['Hours'], data['OLS_pred'], label='OLS fit', linewidth=2)
plt.legend(); plt.show()

Ordinary Least Squares, or OLS, is the most basic and most widely used type of regression. You can think of it as drawing the single “best possible” straight line through your data points — the line that passes closest to all the dots on average. It tries to balance how far each point is from the line so that no point is treated as more important than another. Every observation gets an equal say in where that line ends up.

When you run ols = sm.OLS(y, X).fit(), the computer calculates this best-fitting line for you. The command data['OLS_pred'] = ols.predict(X) then uses that line to predict what values of the dependent variable (in our example, exam scores) the model thinks you should see for each value of the independent variable (hours studied).

When you plot these results with plt.scatter for the data and plt.plot for the fitted line, you get a clear picture: the scattered points show the actual data, and the line shows what the model believes is the general trend. The line passes roughly through the middle of the cloud of points because that’s where the model thinks the “average relationship” lies.

OLS works under two key assumptions. First, it assumes that the variation in your data is roughly the same everywhere — the spread of points around the line doesn’t systematically get wider or narrower as you move along the x-axis. This property is called homoscedasticity. Second, it assumes that the errors (the little vertical differences between each point and the line) are unrelated to each other — one error doesn’t predict the next. This is the independence assumption.

Weighted Least Squares, or WLS, is like OLS with a bit of common sense added. In the real world, not all data points are equally trustworthy. Some come from situations with more noise or uncertainty — for example, students who studied for many hours might have very different results depending on stress, sleep, or luck. WLS lets us tell the computer, “trust the steadier data more, and give less importance to the unpredictable ones.”

weights = 1 / (data['Hours']**2)
wls = sm.WLS(y, X, weights=weights).fit()
data['WLS_pred'] = wls.predict(X)

When we run the line weights = 1 / (data['Hours']**2), we’re simply saying that the higher the number of study hours, the smaller the weight we’ll give that data point, because those students’ scores are more scattered. The command wls = sm.WLS(y, X, weights=weights).fit() asks Python to find a new line that pays more attention to the reliable points and less to the noisy ones. Then data['WLS_pred'] = wls.predict(X) gives us the new predicted scores based on this “weighted” understanding.

plt.scatter(data['Hours'], data['Score'], label='Data')
plt.plot(data['Hours'], data['OLS_pred'], label='OLS', linestyle='--')
plt.plot(data['Hours'], data['WLS_pred'], label='WLS', linewidth=2)
plt.legend(); plt.show()

When you plot both lines — the dashed OLS line and the solid WLS line — they look similar but not identical. The WLS line tends to be a little flatter here because the model is no longer being pulled upward by those less reliable, high-hour data points. It’s as if the WLS method listened politely to everyone but leaned in a bit closer to the people who spoke more clearly.

In simple terms, WLS corrects for uneven reliability in your data. Instead of pretending every observation is equally accurate, it gives stronger influence to consistent data and downplays the messy ones. The result is a line that tells a more balanced and realistic story when your data vary in quality or stability.

Generalized Least Squares, or GLS, goes one step further than WLS. It not only handles differences in reliability between data points but also deals with the situation where the errors themselves are connected. This kind of problem is especially common in time-based data — for example, when today’s result depends partly on what happened yesterday.

Imagine you are recording test scores for students over several days of study. If a student performs better on one day, it’s likely they’ll perform somewhat better the next day too. That means the “errors” — the small gaps between what the model predicts and what actually happens — aren’t truly independent. Ordinary regression methods, like OLS or even WLS, assume those errors are random and unconnected. When that assumption breaks, the model starts to get overconfident. It might still draw a line through the data, but the uncertainty it reports (those little statistical “I’m sure” statements) becomes misleadingly small.

GLS fixes that by taking into account the way these errors move together. It looks at the pattern of connections among the data points — for example, how much one observation resembles its neighbor — and adjusts the fit accordingly. In our short example, the code block creates a simple pattern of mild correlation (each error being somewhat related to the previous one). When the GLS model is fitted, it subtly reshapes the line to reflect that dependency.

from scipy.linalg import toeplitz
rho = 0.5
sigma = rho ** toeplitz(np.arange(len(data)))
gls = sm.GLS(y, X, sigma=sigma).fit()
data['GLS_pred'] = gls.predict(X)


plt.scatter(data['Hours'], data['Score'], color='black', label='Data')
plt.plot(data['Hours'], data['OLS_pred'], '--', label='OLS')
plt.plot(data['Hours'], data['WLS_pred'], '-.', label='WLS')
plt.plot(data['Hours'], data['GLS_pred'], '-', label='GLS')
plt.legend(); plt.show()

The above block of code is creating a pattern of correlation between the observations so that the GLS model knows how the data points are related to one another. The line rho = 0.5 defines how strongly the errors are connected — in this case, each error is assumed to be moderately related (50%) to the previous one, like a gentle ripple that fades over time. The toeplitz(np.arange(len(data))) function from SciPy builds a special kind of matrix where each diagonal contains the same number. When you raise rho to the power of that matrix, you get a table that describes how every pair of observations is correlated: points close together in the data sequence are more strongly connected, while points far apart are less related. This matrix, called sigma, represents the pattern of correlation among the errors.

A visual representation of the matrix sigma is shown below:

This matrix shows how strongly each pair of observations is related to one another in the GLS model. The diagonal values are all 1, meaning each observation is perfectly correlated with itself. As you move away from the diagonal, the numbers get smaller — 0.5 for immediate neighbors, 0.25 for points two steps apart, 0.125 for three, and so on — showing that the connection weakens with distance. In plain terms, this tells the model that nearby data points tend to move somewhat together (their errors are linked), while points further apart act more independently. It’s like saying each observation “remembers” a little of what came before it, but that memory fades as time or distance increases.

The next line, gls = sm.GLS(y, X, sigma=sigma).fit(), tells the GLS model to use that correlation pattern when fitting the regression line. Unlike OLS, which assumes that each observation’s error is independent, GLS “whitens” the data — it adjusts for the fact that neighboring points share some relationship. Finally, data['GLS_pred'] = gls.predict(X) generates the predicted values based on this corrected model. In simple terms, these commands teach the regression that the data have memory — what happens at one point in time slightly influences what happens next — and it should fit the trend line accordingly, without pretending that each point stands completely alone.

When you plot all three lines together, you’ll notice the GLS line sits very close to the others, but with tiny shifts in slope and intercept. Those small changes matter: GLS isn’t trying to draw a completely different line, it’s trying to produce one that tells the truth about the reliability of your data. By acknowledging that observations may “echo” one another, it gives more honest estimates and avoids false precision.

In plain language, if OLS listens to everyone equally and WLS listens more to the steady voices, GLS goes a step further and says, “Some of these voices are actually harmonizing or repeating each other — let’s account for that before deciding what the true message sounds like.”

Now a simple model comparison shows:

comparison = pd.DataFrame({
 'Model': ['OLS', 'WLS', 'GLS'],
 'Intercept': [ols.params[0], wls.params[0], gls.params[0]],
 'Slope': [ols.params[1], wls.params[1], gls.params[1]],
 'R-squared': [ols.rsquared, wls.rsquared, gls.rsquared]
})
print(comparison)

Thi above table compares how the three regression methods — OLS, WLS, and GLS — interpret the same small dataset. The intercept represents the starting point of the line (the predicted score when study hours are zero), and the slope shows how much the predicted score rises with each extra hour of study. The R-squared value indicates how well the model’s line fits the actual data, with values close to 1 meaning a very tight fit. All three models tell a similar story — that more hours of study lead to higher scores — but they differ slightly in how steeply the line climbs and how confidently they summarize that pattern.

The OLS model gives the steepest slope (6.0), which means it believes scores increase fastest with each extra hour of study. But OLS treats every observation as equally reliable, even the noisier ones. The WLS model, which down-weights those uncertain data points, produces a gentler slope (5.64) and a slightly lower R-squared. It’s as if WLS listened carefully to each student but paid less attention to the ones who might have been distracted or inconsistent, leading to a calmer, more balanced conclusion. The line still captures the upward trend, but it doesn’t overreact to extremes.

The GLS model sits somewhere in between (slope 5.85). It also acknowledges that some students’ performances are linked across observations — perhaps because their results are influenced by similar factors like fatigue or preparation style. This model slightly reduces the apparent strength of the relationship and lowers R-squared again, reflecting its awareness that the data points “talk” to one another rather than being completely independent. In everyday terms, if OLS is the teacher who averages all grades without question, and WLS is the teacher who trusts the most consistent students more, then GLS is the teacher who recognizes that some students copy each other’s answers and adjusts the grading accordingly.

Thank you! 🙂

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

How Statistical Regressions Learn to Trust: OLS, WLS, and GLS in Action

Author(s): Sohom Majumder

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

How Statistical Regressions Learn to Trust: OLS, WLS, and GLS in Action

Author(s): Sohom Majumder

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement