Regression models and their estimation is essentially what econometrics is about. By this we mean equations of the form $$Y_t = {\beta_1} X_t + u_t + \beta_0 $$ where $X$ may itself be a function (linear or non-linear) of another variable. We commence with a bivariate regression model in order to keep in mind that in practice there may be more than one model parameter $\beta$. The regression models themselves are almost entirely linear, but that by no means indicates that they are simple because we deal extensively in vectors and matrices which require a thorough understanding of linear algebra.

Bivariate Distributions. Conditional Distributions and Conditional Moments

The very term regression comes from the fact that observations were made in the nineteenth century that the heights of parents $X$ and their children $Y$ seemed to regress towards the mean of the observations – at the time a worrying observation seemingly suggesting that people were getting shorter. We now know however that what was observed was the joint distribution of the heights of parents and children.

\begin{align}
\left(\begin{array}{c}Y \\X \end{array}\right) \sim \mathcal{N} \left(\begin{array}{ccc}\mu_y \; , & \sigma^2_y & \sigma_{xy} \\ \mu_x \; ,& \sigma_{xy} & \sigma^2_x \end{array}\right) \qquad \cdots \text{parents & children}
\end{align}

If the variables are joint normally distributed, then by definition the conditional distribution (height Y of child given the height X of parents) and the marginal distribution (height X of parents) are also normally distributed. The normal distribution can be characterised by its first two moments alone as the third and fourth moments are functions of the first two. From probability theory we have the definition of conditional probabilities/distributions

$$D_j(Y, X; \theta_j) = D_c(Y|X, \theta_c) \times D_m (X, \theta_m)$$

Most of the time in economics we are actually interested in the conditional distribution. What is the output of a firm given a certain input? What are wages like given a person’s age and education? For joint normally distributed variables, we consider conditional moments – conditional expectation and the conditional variance.

\begin{align}
\mathbb{E}(Y | X) = \mu_y + \frac{\sigma_{xy}}{\sigma_x^2} (x \; – \mu_x)\\ \\
\mathbb{E}(X | Y) = \mu_x + \frac{\sigma_{xy}}{\sigma_y^2} (y \; – \mu_y)
\end{align}
these are the regression functions, linear functions of $x$ and $y$ respectively. Observe that $\frac{\sigma_{xy}}{\sigma_x^2}$ is simply the regression coefficient $\beta$ of the regression of $Y$ on $X$ and the converse holds true. What does it mean to regress a variable against another? In this instance, we obtain an estimate “best predictor” of the dependent variable $Y$ (not an estimate of $\beta$) without drawing on the exogeneity assumption or any statement about the relationship between the errors and the regressors.

The statements made so far are not causal statements as we have not addressed the second conditional moments, the conditional variances. By definition the second conditional moment is

\begin{align}
\mathbb{E}\Bigl(y \; – \mathbb{E}( Y | X ) \Bigr)^2 = \mathbb{E}\Bigl(y \; – \mathbb{E}( Y | X ) \Bigr)\Bigl(y \; – \mathbb{E}( Y | X ) \Bigr)’
\end{align}
In the linear regression model $Y= X\beta + u$, we assumed $\mathbb{E}(u) = 0$. Hence $\mathbb{E}(Y | X) = X\beta$. The second moment (variance) then becomes

\begin{align}
\mathbb{V}( Y | X ) = \mathbb{E}(y \; – X\beta )(y \; – X\beta)’ = \mathbb{E}(u u’) = \sigma^2 I
\end{align}
This gives us that $$\mathbb{V}( Y | X ) = \sigma^2$$ Thus it turns out that the distributional assumptions are simply an alternative way of stating the exogeneity assumption $\mathbb{E} (X’u) = 0$ given the classical assumptions of homoskedasticity $\mathbb{E} (u) = 0$, zero serial correlation between the errors $\mathbb{E} (u_t u_{t \; – i}) = 0$ and rank $K$ for matrix X.

In fact, some statisticians will argue that you cannot make assumptions on the relationship between an observed regressor $X$ and an unobservable error $u$. These statisticians shy away from the explicit definition of exogeneity based on the independence of the errors and regressors. They define the assumption via the conditional distributions of the observed variables or in other words a partition of the joint distribution.

Comparing the distribution of $Y$ for the univariate and multivariate scenarios we can simply state

\begin{align}
\text{univariate} \quad y_t &= \beta’ x_t + u_t \qquad y_t \sim \mathcal{IN}(\beta’ x_t, \sigma^2) \\
\text{multivariate} \quad Y &= X \beta x_t + u \qquad Y \sim \mathcal{N}(X \beta, \sigma^2I) \\
\end{align}

Note that in the multivariate case, the off-diagonals of the variance-covariance marix $\sigma^2$ are zero and this implies independence and therefore it is not explicitly stated.

Independence is a much stronger assumption than zero correlation. However, in the case of the normal distribution, zero correlation implies independence of joint normally distributed variables

We represent this in a formula as $$f(y_t, \theta) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \Bigr( \frac{y \; – \beta x_t}{\sigma}\Bigr)^2} $$ (For a single observation. Again, remember that $E(y) = x\beta$).
This is the probability density of the normal distribution with $\theta = (\beta, \sigma^2)$ being the parameters of the distribution. For a $T$ (or if you like $K$) columned regressor matrix $X$, each observation has the same probability density. Therefore to obtain the probability density for the entire sample we must multiply the probabilities of each sample observation:
\begin{align}f(y_1, \theta), f(y_2, \theta) \cdots f(y_T \theta) &= \prod_{i=1}^T \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \frac{(y_i \; – \beta x_i)^2}{\sigma^2}}\\ \\
&= \frac{1}{\sqrt{2 \pi \sigma^2}} \text{exp}\Bigl(-\frac{1}{2} \frac{\sum_{i=1}^T(y \; – \beta x_i)^2}{\sigma^2}\Bigr)
\end{align}
$(y_i – x_i \beta)^2 = u_i^2$, the squared error which in the multivariate case (the whole sample) gives
\begin{align}
(y_T, \theta) = \frac{1}{(\sqrt{2 \pi \sigma^2})^T} \text{exp}\Bigl(-\frac{1}{2} \frac{u’u}{\sigma^2}\Bigr)
\end{align}

This is the probability of observing the sample, given that the observations are independent and normally distributed.

We can now reverse this conclusion and pose the question; given an observed sample, what is the likelihood that the probability density function that describes the sample data comes from the distribution characterised by $f(y_t, \theta)$? We call this the maximum likelihood and define the likelihood function $L(\theta)$ of the sample: $$L(\theta) = f(y_1, \theta), f(y_2, \theta) \cdots f(y_T \theta) $$ Dealing with sums is generally easier than dealing with products so we define the Log Likelihood function $$LL(\theta) = \sum_{i = 1}^T \text{ln}\; f(y_i, \theta)$$