Part 1: The geometry in ℝⁿ, math and intuition of OLS and standard errors under homoskedasticity

May 2026

Okay so how exactly should you visualise the set up that you perform OLS estimation in? Honestly, I wasn't too sure, and in the process of trying to figure out how to compute cluster robust standard errors for a multinomial logit, I decided to start from the basics, and understand the math and geometry behind ordinary standard errors for Ordinary Least Squares estimates.

Throughout, we use Y = (2, 3, 6), so n = 3 and β̂₀ = mean(Y) = 11/3 ≈ 3.67. Rotate every plot — that's the point.

Step 1 — You got data, congratulations, now what?

Okay let's say you got some data, the wage of your 3 closest neighbors, and maybe their highest completed grade. You want to run an OLS, but since you're annoying, you want to do it from scratch, mathsy mathsy, and not run any reg, lm, or whatever python people came up with function on a software. How do you get your point estimates? How do you compute your standard errors? What are you even doing when you run OLS? (And I will leave you figure out hypothesis testing for yourself, I am not entering that realm from a core math perspective just yet). Honestly, I don't know precisely enough, and that's why I am writing this.

Okay so your friends are not doing very very well for themselves. One of them earns 2 dollars per hour, the second 3, and the third 6 (he always pays the first and only round). That, essentially, is a point in ℝ³, and this is where, if I understand correctly, one should start. There are no multiple points here — the entire dependent variable is encoded in the coordinates of this single red dot.

Step 2 — The Column Space / Image of X: What are your options?

Okay now the idea is to say "I have this data, what can I do about it that would be informative?". A first and solid answer to that would be "run an OLS to learn about the relationship between some covariates and the dependent variable". Okay but what does the OLS do? Let's start with the basics: The column space / Image of X — Step 1: The constant.

If you read until here, please get checked, and find some friends. Additionally, if you've read so far, you probably know that if you regress Y on a constant, you get its mean. Why is that?

Look at things in the following way:

the constant is simply the vector (1,1,1) in ℝ³
The line going from the origin through (1,1,1) is essentially the set of all possible fitted values such that all three fitted values are equal. It is the Column Space / Image of X
Your point estimate is the scalar by which you multiply (1,1,1) such that the distance between Y and (Ȳ, Ȳ, Ȳ) is minimized

Step 2b — How to find the point estimate? (Minimizing the Sum of Squared Residuals)

Now you want to find the point on the column space of X that is closest to Y. The distance between two points (a, b, c) and (d, e, f) in ℝ³ is √((a−d)² + (b−e)² + (c−f)²). Since √ is monotone, minimizing the distance is the same as minimizing the sum of squares inside — so OLS is literally minimizing Σ êᵢ², the sum of squared gaps between the line and the point Y.

Those gaps are the individual residuals. The plot below makes that concrete: instead of drawing the residual ê as one dashed diagonal line from Ŷ to Y, you can walk from Ŷ to Y one axis at a time — one step per observation. Each colored segment is êᵢ · eᵢ: the miss for observation i, measured along that observation's axis.

Pink moves along the Obs-1 axis by ê₁ ≈ −1.67 (friend one earns below average), gold moves along Obs-2 by ê₂ ≈ −0.67 (friend two, also below), teal moves along Obs-3 by ê₃ ≈ +2.33 (friend three earns above — he's buying the round). The three steps add up to exactly the dashed line: ê = ê₁·e₁ + ê₂·e₂ + ê₃·e₃.

This is just the standard coordinate decomposition of a vector in ℝ³ — nothing magic yet. The magic comes next, when we ask where ê is forced to live.

Step 3 — The projection and the orthogonal complement

OLS picks the point on the line closest to Y. Call that point Ŷ. Geometrically, it's the orthogonal projection of Y onto the line. Algebraically, Ŷ = mean(Y) · (1, 1, 1) — the sample mean, tiled three times.

The residual ê = Y − Ŷ must, by construction, be perpendicular to the line — otherwise we could shorten the distance by moving Ŷ slightly. So ê lives somewhere in the 2D plane perpendicular to (1, 1, 1), passing through Ŷ. That plane is the orthogonal complement of col(X), shown in orange below.

Notice: the plane has dimension 2, which is exactly n − k = 3 − 1.

Step 4 — The full decomposition: Y = Ŷ + ê

Now we see the full decomposition. Y = Ŷ + ê, written as two arrows: a green one from the origin to Ŷ (the fit), and a red one from Ŷ to Y (the residual). The dashed gray arrow from the origin to Y is just there to show that vector addition closes the triangle.

The green and red arrows are perpendicular — you can verify this by rotating the plot until you view the (1, 1, 1) direction head-on. From that angle, the green arrow collapses to a point, and the red arrow juts out horizontally.

Step 5 — What changes when we add a covariate?

Now suppose we also observe education for each friend: X₂ = (2, 2, 4) — friends one and two both have 2 years of college, friend three has 4. Our design matrix goes from a single column (1,1,1)ᵀ to a 3×2 matrix with an intercept and an education column.

The column space of X grows from a 1D line to a 2D plane — specifically, span{(1,1,1), (2,2,4)}. The two blue lines inside the plane are those two spanning directions: (1,1,1) (the intercept, same as before) and (2,2,4) (the education vector). Any fitted vector is some linear combination a·(1,1,1) + b·(2,2,4) = (a+2b, a+2b, a+4b).

Notice that for this particular X, the first two components of every fitted vector are always equal — because friends one and two happen to have the same values for every regressor (intercept 1, education 2). That's why the plane visually coincides with {Obs 1 = Obs 2} here, but that's a feature of this dataset, not a definition of the column space.

Now β̂ = (β̂₀, β̂₁) has to be chosen from all possible pairs. Each choice of (β₀, β₁) places the fitted vector at the point β₀·(1,1,1) + β₁·(2,2,4) somewhere on the plane. OLS picks the pair that minimizes Σ êᵢ² — the squared distance between that point and Y. Just as in the k=1 case, this is a search over a subspace (now 2D instead of 1D), and the minimizer is the orthogonal projection of Y onto it.

The result: β̂₀ = −1, β̂₁ = 1.75, giving Ŷ = (2.5, 2.5, 6). The dotted red segment from Ŷ to Y is ê — short but clearly sticking out perpendicular to the plane. Rotate the plot to see it pop out.

Step 5b — The residual decomposition with k=2

Same staircase as before — walk from Ŷ to Y one axis at a time. Pink steps along the Obs-1 axis by ê₁ = −0.5 (friend one earns below his fitted value), gold steps along Obs-2 by ê₂ = +0.5 (friend two earns above). The teal diamond at the end marks the Obs-3 step: ê₃ = 0. Friend three is perfectly fit, so there is nothing to walk.

The residual ê is now constrained to a 1D line. You can work out which one: ê must satisfy ê·(1,1,1) = 0 (residuals sum to zero) and ê·(2,2,4) = 0 (residuals are orthogonal to the education column). Those two equations force ê₃ = 0 and ê₂ = −ê₁, so the entire residual lives on the line (t, −t, 0). Adding one covariate used up one dimension of the residual space — that's all n − k in σ̂² = Σêᵢ² / (n − k) is tracking.

The algebra — deriving β̂ in matrix form

The geometry says: project Y onto col(X). The algebra says: find β that minimizes the squared length of the residual. Both give the same answer — here's how the algebra gets there.

Write the objective

min_β S(β) = (Y − Xβ)ᵀ(Y − Xβ)

= YᵀY − 2βᵀXᵀY + βᵀXᵀXβ

First-order condition — set ∂S/∂β = 0

−2XᵀY + 2XᵀXβ = 0

Rearrange — the normal equations

XᵀX β = XᵀY

Solve (assuming XᵀX is invertible)

β̂ = (XᵀX)⁻¹ XᵀY

Now plug in our numbers: Y = (2, 3, 6)ᵀ and X =121214.

Compute XᵀX and XᵀY

XᵀX =38824XᵀY =1134

Invert XᵀX

det(XᵀX) = 3·24 − 8·8 = 72 − 64 = 8

(XᵀX)⁻¹ =1/8 ·24−8−83

β̂ = (XᵀX)⁻¹ XᵀY

β̂ =1/8 ·24−8−83·1134=1/8 ·264−272−88+102=1/8 ·−814=−11.75

Result and check

β̂₀ = −1, β̂₁ = 1.75

Ŷ = −1·(1,1,1) + 1.75·(2,2,4)

= (−1,−1,−1) + (3.5, 3.5, 7)

= (2.5, 2.5, 6) ✓

Step 6 — Estimating σ², the variance of the noise

Okay so we have β̂. Point estimates in hand. But a point estimate without a standard error is basically a guess dressed up in matrix notation. To get standard errors, we need σ² — the variance of the noise ε that we never observe directly.

What we do observe is ê = Y − Ŷ. And the key insight is that ê is not some opaque distortion — it's a linear function of ε. Define the annihilator matrix M = I − X(XᵀX)⁻¹Xᵀ. Then:

ê = Mε

To see why, substitute Y = Xβ + ε into ê = MY:

ê = M(Xβ + ε) = MXβ + Mε = 0 + Mε = Mε

The MXβ term vanishes because MX = 0 — M is constructed to kill anything in col(X). Geometrically: M projects onto col(X)⊥, and col(X) is orthogonal to col(X)⊥ by definition. So the residual is just the noise ε, after the systematic part Xβ has been projected away.

Compute the squared length of ê

Since ê = Mε, and M is symmetric (Mᵀ = M) and idempotent (M² = M):

êᵀê = (Mε)ᵀ(Mε) = εᵀMᵀMε = εᵀM²ε = εᵀMε

Take expectations — the trace trick

εᵀMε is a scalar, so it equals its own trace. Use the cyclic property tr(ABC) = tr(BCA):

E[εᵀMε | X] = E[tr(Mεεᵀ) | X] = tr(M · E[εεᵀ | X])

Under homoskedasticity, E[εεᵀ | X] = σ²I:

E[êᵀê | X] = tr(Mσ²I) = σ² · tr(M)

tr(M) = n − k — two ways to see it

Algebraically: tr(M) = tr(I) − tr(X(XᵀX)⁻¹Xᵀ) = n − tr((XᵀX)⁻¹XᵀX) = n − tr(Iₖ) = n − k.
Geometrically: M is a symmetric idempotent that projects onto col(X)⊥, which has dimension n − k. For symmetric idempotents, trace = rank = dimension of the projection target.

This is the exact same n − k we kept seeing in the plots — the number of free dimensions left in col(X)⊥ after k dimensions were claimed by col(X). It's not a degrees-of-freedom incantation; it literally drops out of the algebra as tr(M).

The unbiased estimator

Putting it all together: E[êᵀê | X] = σ²(n − k). Divide by n − k:

σ̂² = êᵀê / (n − k)

This is unbiased by construction: E[σ̂² | X] = σ².

Plugging in our numbers

ê = (−0.5, +0.5, 0), so êᵀê = (−0.5)² + (0.5)² + 0² = 0.25 + 0.25 = 0.5.

With n = 3 and k = 2:

σ̂² = 0.5 / (3 − 2) = 0.5

Here's what that looks like geometrically. The purple line below is col(X)⊥ — the 1D line span{(1,−1,0)} that ê is forced to live in. The gray plane is col(X). Rotate until you're looking straight down the col(X) direction: the plane vanishes, and you'll see ê sitting on the purple line while Ŷ collapses to the origin — that's M·Ŷ = 0. M annihilates the plane and keeps only the line.

Step 6b — What does OLS “see” of the true noise ε?

So far we've been looking at ê — the residual we observe. But underlying the data is a true noise vector ε that we can't see. Let's make that concrete with a specific data-generating process.

Suppose the true model is wage = 1.5 × education + ε — no base wage, every extra year of education earns $1.50. With education = (2, 2, 4), the true signal is Xβ = (3, 3, 6). Compare that to our observed Y = (2, 3, 6): friend 1 should earn $3 but earns $2, so ε₁ = −1. Friends 2 and 3 earn exactly what the model predicts: ε₂ = ε₃ = 0. So the true noise is ε = (−1, 0, 0).

That vector sits somewhere in ℝ³ — not in col(X), not in col(X)⊥. M decomposes it into two pieces:

P·ε = (−0.5, −0.5, 0) — the part of ε that lands inside col(X). OLS never sees this piece: it doesn't affect the residual ê, but it does shift the fitted value from Xβ = (3, 3, 6) to Ŷ = (2.5, 2.5, 6) within the plane. This is exactly why β̂ ≠ β_true even in a correctly specified model — the noise leaks into the coefficient estimates.
M·ε = (−0.5, +0.5, 0) = ê — the part in col(X)⊥. This is everything OLS can recover from the noise. It sits exactly on the purple line span{(1,−1,0)}, which is why σ̂² = êᵀê/(n−k) estimates σ² without needing to know ε directly.

The plot below shows this as a classic projection picture. The orange vector is ε = (−1, 0, 0), drawn from the origin. Drop a perpendicular from its tip onto the purple col(X)⊥ line — the foot of that perpendicular is ê = (−0.5, +0.5, 0), shown as the purple diamond. That dashed gray segment is the drop; the purple diamond is where it lands. The dotted purple line from Ŷ to Y is the same ê in the context of the original data — both views are showing the same vector. This higlights the equivalence between ê = Y - Ŷ = M·e

The rationale I was given by Claude was the following: You have an error vector in ℝ³, with elements drawn from a distribution with the same variance. If you look at the length of that vector squared you should get 3σ². However, since you are not projecting the Y (and the errors) onto an n−k space, then you are artificially shrinking the length of that projected vector. In fact, in a space of dimension n−k, the expected length squared of the error vector is (n−k)σ². As a result, when you look at the residuals, they can actually only move along an (n−k) dimensional space, and thus to adjust for that and get an estimate of the error variance, you need to readjust the denominator of your estimator to (n−k). This is visually why the error variance estimator divides the sum of squared residuals by (n−k) and not by n.

Step 7 — Standard errors for β̂

Okay, last piece. σ̂² is in hand. Now we want the variance of β̂ itself — how much would the estimates bounce around if we had a different draw of ε?

Start from β̂ = (XᵀX)⁻¹XᵀY and substitute Y = Xβ + ε:

β̂ = (XᵀX)⁻¹Xᵀ(Xβ + ε) = β + (XᵀX)⁻¹Xᵀε

So β̂ − β = (XᵀX)⁻¹Xᵀε — the estimation error is a linear function of the noise. That's actually useful, because we know exactly how to take the variance of a linear function.

Apply the variance formula for linear transformations

For any matrix A, Var(Aε) = A · Var(ε) · Aᵀ. With A = (XᵀX)⁻¹Xᵀ:

Var(β̂ | X) = (XᵀX)⁻¹Xᵀ · Var(ε | X) · X(XᵀX)⁻¹

Apply homoskedasticity — Var(ε|X) = σ²I

The middle collapses:

Var(β̂ | X) = σ²(XᵀX)⁻¹XᵀX(XᵀX)⁻¹

= σ²(XᵀX)⁻¹

Standard errors

Var(β̂ | X) is a k×k matrix — diagonal entries are the variances of individual coefficients, off-diagonals are covariances. Plug in σ̂² and take the square root of the j-th diagonal entry:

SE(β̂ⱼ) = √[ σ̂² · ((XᵀX)⁻¹)ⱼⱼ ]

The two halves of this formula have a clean geometric interpretation. (XᵀX)⁻¹ encodes the shape of the regressors inside col(X) — how spread out they are, how correlated. σ̂² is estimated from the residuals that live outside col(X), in col(X)⊥. You need both. The column space gives you the shape of the uncertainty; the orthogonal complement gives you the scale.

Plugging in our numbers

Var(β̂) = σ̂² · (XᵀX)⁻¹ = 0.5 ·1/8 ·24−8−83=1.5−0.5−0.50.1875

SE(β̂₀) = √1.5 ≈ 1.22

SE(β̂₁) = √0.1875 ≈ 0.43

The SEs are large — but n = 3, so we're estimating a wage regression on three friends. Run this on actual data and you'd get something more sensible.

Step 8 — Where does regressor correlation hide in the SE formula?

Looking at SE(β̂ⱼ) = √[ σ̂² · ((XᵀX)⁻¹)ⱼⱼ ], you might wonder: where exactly does the correlation between regressors enter? The formula has (XᵀX)⁻¹ in it, which supposedly encodes regressor geometry — but it's not obvious where collinearity is hiding.

The answer is in the determinant. Take the 2×2 case — intercept plus one covariate — where XᵀX has a and c on the diagonal and b off-diagonal:

XᵀX =abbc(b = x₁ᵀx₂, the exact inner product of the two regressor columns)

(XᵀX)⁻¹ =1/(ac − b²)·c−b−ba

The determinant ac − b² sits in the denominator of every entry — including both diagonals, which are what go into the SEs.

As the regressors become more correlated, b grows. The determinant ac − b² shrinks toward zero. Every entry of (XᵀX)⁻¹ blows up. SEs blow up with it.

The extreme case is perfect collinearity — b² = ac, determinant = 0, (XᵀX)⁻¹ doesn't exist. Geometrically: the two regressors point in exactly the same direction in ℝⁿ, the column space collapses to 1D even though you fed in two columns, and OLS can't split the fitted value into "credit for X₁" versus "credit for X₂." No unique β̂, full stop.

There's an equivalent formula that makes this completely explicit. The variance of a single coefficient can be written as:

Var(β̂ⱼ) = σ² / [ (n−1) · Var(Xⱼ) · (1 − R²ⱼ) ]

where R²ⱼ is the R² from regressing Xⱼ on all the other regressors. The factor 1 / (1 − R²ⱼ) is the variance inflation factor (VIF). If Xⱼ is uncorrelated with the other regressors, R²ⱼ = 0 and VIF = 1 — no penalty. If Xⱼ is 90% explained by the others, R²ⱼ = 0.9 and VIF = 10, meaning the SE is √10 ≈ 3.16× larger than it would be without the correlation.

In our running example: regressing education X₂ = (2, 2, 4) on the intercept gives R²₂ = 0 — once you demean, the education values have no remaining pattern that the intercept can explain. So VIF = 1, no inflation. Which is consistent with what we found in Step 7 — plug into the formula to check:

σ² / [ (n−1) · Var(X₂) · 1 ] = 0.5 / [ 2 · (4/3) ] = 0.5 / (8/3) = 0.1875 ✓

Geometrically: if the columns of X nearly align in ℝⁿ, the projection Ŷ onto col(X) is still uniquely defined — but splitting that projection into "how much came from X₁" and "how much from X₂" becomes ambiguous. Tiny movements in Y can shift a lot of credit between β̂₁ and β̂₂ even though Ŷ barely moves. High variance in β̂, low variance in Ŷ. The determinant of XᵀX is exactly what measures how much room there is to distinguish the regressors — and when it gets small, there's almost none.

Step 8b — What changes when k > 2?

Nothing fundamental. For a general k×k matrix XᵀX, the inverse is still (adjugate)/(det) — the determinant lives in the denominator of every entry, diagonals included. More regressors, bigger matrix, same structure.

The multicollinearity story extends cleanly. Near-collinearity among any subset of regressors — not just a pair — shrinks det(XᵀX) toward zero and inflates all SEs. If X₃ is nearly a linear combination of X₁ and X₂, those three columns of X nearly span only a 2D subspace in ℝⁿ rather than 3D, the determinant reflects that collapse, and everything gets inflated. The exact condition is det(XᵀX) = 0, which happens precisely when the columns of X are linearly dependent — i.e., when col(X) has dimension strictly less than k.

The VIF formula carries over identically. For each non-intercept regressor Xⱼ, regress Xⱼ on all the other regressors, get R²ⱼ, and VIF = 1/(1 − R²ⱼ). Rule of thumb: VIF > 10 (equivalently R²ⱼ > 0.9) is usually treated as a concern. Beyond that, the higher the VIF, the closer you are to exact collinearity.

That's the whole story. The point Y in ℝⁿ gets decomposed into a piece in col(X) and a piece in col(X)⊥. The first piece gives you β̂. The second piece gives you σ̂². Combining them gives you standard errors.

Everything else in regression — t-statistics, F-tests, confidence intervals, robust standard errors, clustered standard errors — is built on top of this decomposition. The trick going forward is usually about how you treat the residual: homoskedastic, heteroskedastic, clustered, and so on. But the geometry never changes.

← Back to Getting to clustered SEs for MNL