The Least Squares Linear Regression Model

# Econometrics ## Intro ### Introduction Model builders are oftern interested in understanding the *conditional variation* of one variable relative to others rather than their *joint probability* \vfill Question: What feature of the conditional probability distribution are we interested in? \vfill Usually, the expected value $E[y|x]$, but sometimes might be:        Conditional median or other quantiles of the distribution (20th percentile,        5th percentile, etc), variance \vfill Linear regression deals with **conditional mean** ## The Linear Regression Model ### The Linear Regression Model $\mathbf{y}=f(\textbf{x}_1, \textbf{x}_2, \cdots, \textbf{x}_k)+\varepsilon$, where $\varepsilon$ is called the **disturbance** term. \vfill Our **theory** will specify the population regression equation $f(\textbf{x}_1, \textbf{x}_2, \cdots, \textbf{x}_k)$, which encompasses its format and the variables that matter. \vfill ## Assumptions of the Linear Regression Model ### Assumptions of the Linear Regression Model The linear regression model consists of a set of assumptions about how a data set will be produced by an underlying ”data generating process.” \vfill **Assumption A1**: The model specifies a linear relationship between $y$ and $\textbf{x}_1,\cdots, \textbf{x}_k$: $$\mathbf{y}=\textbf{x}_1\mathbf{\beta}_1+\textbf{x}_2\mathbf{\beta}_2+\cdots+\textbf{x}_k\mathbf{\beta}_k+\mathbf{\varepsilon}$$ \vfill Notice that the assumption is about the linearity in the parameters rather than in the $\mathbf{x}$'s. ### Linearity of the Regression Model Each observation of a given data set looks like $$y_1=\beta_1 x_{11}+\beta_2x_{21}+\cdots\beta_kx_{k1}+\varepsilon_1$$ $$y_2=\beta_1 x_{12}+\beta_2x_{22}+\cdots\beta_kx_{k2}+\varepsilon_1$$ $$\vdots$$ $$y_n=\beta_1 x_{1n}+\beta_2x_{2n}+\cdots\beta_kx_{kn}+\varepsilon_1$$ ### Linearity of the Regression Model In Matrix form: ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/matrix2.png) $$\textbf{y}=\textbf{X}\mathbf{\beta}+\mathbf{\varepsilon}$$ ### Ful Rank **Assumption A2**: The columns of $X$ are linearly independent and there are at least $k$ observations. \vfill Assumption A2 states that there are no linear relationships among the variables. \vfill Here's an example of a model that cannot be estimated, although we might be interested in quantifying each of the coefficients: the determinants of Monet's prices: $$\ln \text{Price} = \beta_1 \ln \text{Size} + \beta_2 \ln \text{Aspect Ratio} + \beta_3 \ln \text{Height}+\varepsilon$$ where $\text{Size}=\text{Width}\times\text{Height}$ and $\text{Aspect Ratio}=Width/Height$ ### Regression **Assumption A3**: The disturbance is assumed to have conditional expected value zero at every observation: $E(\mathbf{\varepsilon}|\mathbf{X})=0$ \vfill No value of $\mathbf{X}$ conveys any information about $\varepsilon$. We assume that $\varepsilon_i$'s are purely random draws from a population. \vfill Moreover, we assume $E[\varepsilon_i|\varepsilon_1, \cdots, \varepsilon_{i-1}, \varepsilon_{i+1}, \cdots, \varepsilon_n ]=0$. \vfill Notice that by the **Law of Iterated Expectations**: $$E[\varepsilon_i]=E_X[E[\varepsilon_i|\mathbf{X}]]=E_X[0]=0$$ \vfill ### Regression Point to note: $E[\varepsilon|\mathbf{X}]=0 \Rightarrow Cov(\mathbf{X},\varepsilon)=0$. But the converse is not true: $E[\varepsilon]=0$ **does not** imply that $E[\varepsilon|\mathbf{X}]=0$. \vfill Accordingly, $E[\mathbf{y}|\mathbf{X}]=\mathbf{X}\mathbf{\beta}$. \vfill Assumptions **A1** and **A3** comprise the *linear regression model*. \vfill What if $E[\mathbf{\varepsilon}]\neq0$? ### ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/example_2.2.png) ### Regression Assumption **A3** is called the **exogeneity** assumption and it yields $E[\mathbf{y}]=\mathbf{X}\mathbf{\beta}$. \vfill Whenever $E(\mathbf{\varepsilon}|x)\neq 0$, we say that $x$ is **endogenous** to the model. One way that this can happen is when we leave out a variable that matters for the relationship. \vfill Suppose the DGP of a given relationship is given by $$Income=\gamma_1+\gamma_2 educ+\gamma_3 age +u$$ but we estimate the model $$Income=\gamma_1+\gamma_2 educ +\varepsilon$$ \vfill How do we show that **A3** is not satisfied? ### Homoskedasticity and Nonautocorrelated Disturbances **Assumption A4**: $E[{\varepsilon\varepsilon'|\mathbf{X}}]=\sigma^2\mathbf{I}$ \vfill Also, notice that $Var[\varepsilon]=E[Var(\varepsilon|\mathbf{X})]+Var[E(\varepsilon|\mathbf{X})]=\sigma^2\mathbf{I}$ ### Data Generating Process for the Regressors **Assumption A5**: $\mathbf{X}$ may be fixed or random. \vfill Fixed $\mathbf{X}$: Experimental designs, whereby the researcher fixes the values of $\mathbf{X}$ to find $\mathbf{y}$. \vfill Random $\mathbf{X}$: Observational studies. However, some columns of the $\mathbf{X}$ can be fixed, such as indicator variables for a given time period or time trends. ### Normality **Assumption A6**: $\varepsilon|\mathbf{X} \sim N(\mathbf{0},\sigma^2\mathbf{I})$ \vfill This assumption is useful for hypothesis testing and constructing confidence intervals but might not be needed as the Central Limit Theorem applies to sufficiently large data. ### Visual Summary of the Assumptions ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/figure_2.3.png) ## Computation of the Least Squares Regression ### Computational Aspects of the Least Squares Regression Let's now consider the algebraic problem of choosing a vector $\mathbf{b}$ so that the fitted line $\mathbf{x}_i'\mathbf{b}$ is *close* to the data. \vfill We need to specify what do we mean by *close* to the data (the fitting criterion). \vfill Usually, the fitting criterion is the *Least Squares* method: minimizing the sum of the squared deviations from the mean. \vfill Crucial feature: LS regression provides us a device for "holding other things constant". \vfill ### The LS Population and Sample Models Recall the population regression model: $E[y_i|\mathbf{x}_i]=\mathbf{x}_i'\beta$ \vfill We aim to find an estimate $\hat{y_i}=\mathbf{x}_i'\mathbf{b}$ \vfill Define the *residuals* from the estimated regression as $$e_i=y_i-\mathbf{x}'_ib$$ \vfill Notice that $y_i=\mathbf{x}_i'\beta+\varepsilon_i=\mathbf{x}_i'b+e_i$ \vfill ### The LS Coefficient Vector The Least Squares criterion requires us to minimize $$\sum_{i=1}^n{e_i^2}=\sum_{i=1}^n{(y_i-\mathbf{x}_i'b)^2}$$ \vfill In matrix terms, we minimize $$S(\mathbf{b})=\mathbf{e}'\mathbf{e}=(\mathbf{y}-\mathbf{X}\mathbf{b})'(\mathbf{y}-\mathbf{X}\mathbf{b})$$ \vfill Expanding, we have $$S(\mathbf{b})=\mathbf{y}'\mathbf{y}-2\mathbf{y}'\mathbf{X}\mathbf{b}+\mathbf{b}'\mathbf{X}'\mathbf{X}\mathbf{b}$$ \vfill ### The LS Coefficient Vector The necessary condition for a minimum is $$\frac{\partial S(\mathbf{b})}{\partial \mathbf{b}}=-2\mathbf{X}'\mathbf{y}+2\mathbf{X}'\mathbf{X}\mathbf{b}=\mathbf{0}$$ $$\mathbf{X}'\mathbf{X}\mathbf{b}=\mathbf{X}'\mathbf{y}$$ \vfill From **A2**, we know that $\mathbf{X}$ has full rank, which guarantees the existence of its inverse. Then, pre-multiplying both sides by $(\mathbf{X}'\mathbf{X})^{-1}$: $$b_0=(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$$ \vfill For the solution $b_0$ to minimize the sum of the squared residuals, the matrix $\frac{\partial^2 S(\mathbf{b})}{\partial \mathbf{b}^2}=2\mathbf{X}'\mathbf{X}$ must be positive definite. ### Algebraic Aspects of the LS Solution We have $$\mathbf{X}'\mathbf{X}\mathbf{b}-\mathbf{X}'\mathbf{y}=-\mathbf{X}'(\mathbf{y}-\mathbf{X}\mathbf{b})=-\mathbf{X}'\mathbf{e}=\mathbf{0}$$ \vfill Hence, for every column of $\mathbf{X}$, $\mathbf{x}_k'\mathbf{e}=0$. \vfill Denote the first row $\mathbf{X}$ as $\mathbf{x}_1\equiv \mathbf{i}$, two implications follow: 1. The LS residuals sum to zero. 2. The regression hyperplane passes through the point of means of the data. \vfill ### Projection Recall the LS residuals: $$\mathbf{e}=\mathbf{y}-\mathbf{Xb}$$ \vfill Inserting $\mathbf{b}_0$, we have $$\mathbf{e}=\mathbf{y}-\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}=(\mathbf{I}-\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}')\mathbf{y}=\mathbf{M}\mathbf{y}$$ \vfill The matrix $\mathbf{M}$ is called the "*residual maker*": \vfill ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/definition_3.1.png) ### The Residual Maker Properties of the matrix M: 1. M is symmetric 2. M is idempotent 3. $\mathbf{MX}=\mathbf{0}$ (why?) ### The Projection Matrix Now let $$\hat{\mathbf{y}}=\mathbf{y}-\mathbf{e}=\mathbf{I}\mathbf{y}-\mathbf{My}=(\mathbf{I}-\mathbf{M})\mathbf{y}$$ \vfill Thus, $$\hat{\mathbf{y}}=\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}=\mathbf{Py}$$ \vfill P is called a *projection* matrix: If a vector $\mathbf{y}$ is pre-multiplied by $\mathbf{P}$, the result is the fitted values in the LS regression of $\mathbf{y}$ on $\mathbf{X}$. \vfill ### The Projection Matrix Properties of $\mathbf{P}$: 1. $\mathbf{P}$ is symmetric 2. $\mathbf{P}$ is idempotent 3. $\mathbf{P}\mathbf{X}=\mathbf{X}$ \vfill Moreover, notice that $\mathbf{P}$ and $\mathbf{M}$ are orthogonal: $\mathbf{P}\mathbf{M}=\mathbf{M}\mathbf{P}=\mathbf{0}$ \vfill Therefore, the LS regression partitions the vector $\mathbf{y}$ into two **orthogonal** parts: $$\mathbf{y}=\mathbf{P}\mathbf{y}+\mathbf{M}\mathbf{y}= \text{Projection}+\text{Residuals}$$ ### ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/figure_3.2.png) ## Partitioning and Partial Regressions In some situations, we are only interested in a subset of the full set of variables in $\mathbf{X}$. The remaining variables are added to the model as "controls".        Recall the returns to education example. \vfill Suppose we have $$\mathbf{y}=\mathbf{X}\mathbf{\beta}+\mathbf{\varepsilon}=\mathbf{X}_1\mathbf{\beta_1}+\mathbf{X}_2\mathbf{\beta}_2+\varepsilon$$ How can we find the algebraic solution for $\mathbf{b}_2$? That is, what is the LS estimator of a given subset of parameters, $\mathbf{\beta}_2$, in $\mathbf{\beta}$? ### Partial Regressions Set up the **normal** equations: $$\begin{bmatrix} \mathbf{X}_1'\mathbf{X}_1 & \mathbf{X}_1'\mathbf{X}_2 \\ \mathbf{X}_2'\mathbf{X}_1 & \mathbf{X}_2'\mathbf{X}_2 \\ \end{bmatrix} % \begin{bmatrix} \mathbf{b}_1 \\ \mathbf{b}_2 \\ \end{bmatrix} = % \begin{bmatrix} \mathbf{X}_1'\mathbf{y} \\ \mathbf{X}_2'\mathbf{y} \\ \end{bmatrix}$$ \vfill Solving the system above for $\mathbf{b}_1$ yields $$\mathbf{b}_1=(\mathbf{X}_1'\mathbf{X}_1)^{-1}\mathbf{X}_1')(\mathbf{y}-\mathbf{X}_2\mathbf{b}_2)$$ ### Partial Regressions Suppose that $\mathbf{X}_1'\mathbf{X}_2=0$. (what does this mean?) \vfill For this special case, the theorem below states that $\mathbf{b}_1$ can be obtained by regressing $\mathbf{y}$ on $\mathbf{X}_1$ only. Likewise, $\mathbf{b}_2$ can be obtained by regressing $\mathbf{y}$ on $\mathbf{X}_2$ only. \vfill ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/theorem_3.1.png) ### The FWL Theorem For the general case, in which $\mathbf{X}_1$ and $\mathbf{X}_2$ might not be orthogonal, the following theorem provides the more general solution: \vfill ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/theorem_3.2.png) ### The FWL Theorem We can represent $\mathbf{b}_2$ as $$\mathbf{b}_2=(\mathbf{X}_2^{*'}\mathbf{X}_2^{*})^{-1}\mathbf{X}_2^{*'}\mathbf{y}^{*}$$ \vfill where $\mathbf{X}_2^{*}=\mathbf{M}_1\mathbf{X}_2$ and $\mathbf{y}^*=\mathbf{M}_1\mathbf{y}$. \vfill Two questions: 1. What is $\mathbf{M}_1\mathbf{X}_2$? 2. What is $\mathbf{M}_1\mathbf{y}$? \vfill ## The FWL Theorem A special case of the FWL theorem is when we are interested in the computation of a single coefficient. \vfill Consider the regression of $\mathbf{y}$ on a set of variables $\mathbf{X}$ and an additional variable $z$. Denote the coefficients $\mathbf{b}$ and $c$, respectively. \vfill ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/corollary_3.2.1.png) ### The FWL Theorem Example: Suppose we are interested again in the returns to education equation $$Income=\beta_1+\beta_2 educ + \beta_3 age + \beta_4 age^2+\varepsilon$$ \vfill To find $b_1$: 1. Regress $Income$ on $age$ and $age^2$ and obtain residuals $r_1$ 2. Regress $educ$ on $age$ and $age^2$ and obtain residuals $r_2$ 3. Regress $r_1$ on $r_2$ and find slope coefficient $b_1$. ### Regression with a constant term Consider now the partition in which $\mathbf{X}_1=\mathbf{i}$ and $\mathbf{X}_2$ is the set of variables in the regression. \vfill Take a given column $\mathbf{x}$ of $\mathbf{X}_2$. According to the FWL theorem, $$\mathbf{x}_*=\mathbf{M}_1\mathbf{x}$$ \vfill When $\mathbf{X}_1=\mathbf{i}$, we denote $\mathbf{M}_1$ as $\mathbf{M}^0$. \vfill This yields $$\mathbf{x}_*=\mathbf{x}-\mathbf{i}\mathbf{\bar{x}}$$ \vfill ### Regression with a constant term The result above says that the residuals in the regression of the columns of $\mathbf{X}_2$ on a constant term are deviations from the sample mean. \vfill Therefore, each column of $\mathbf{M}_1\mathbf{X}_2$ is the original variable, now in the form of deviations from the mean. This general result is summarized in the following corollary. ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/corollary_3.2.2.png) ### Orthogonal Regression Finally, from the Orthogonal Partition Regression and FWL theorems, the next one states that we can estimate each coefficient separately if the columns of $\mathbf{X}$ are orthogonal to each other. \vfill ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/theorem_3.3.png) ## Partial Correlation and the Goodness of Fit ### The Partial Correlation Coefficients The FWL theorem provides us a framework to "partial out" the effect of a given variable in a regression. \vfill We can apply the same principles to find the degree of *correlation* between two variables after partialling out the effects of other factors. \vfill We proceed as follows: 1. $\mathbf{y_*}=$ residuals in a regression of $y$ on "controls" 2. $\mathbf{z_*}=$ residuals in a regression of $x_k$ on "controls" 3. Find the partial correlation $r_{y,z}^*$, the simple correlation between $\mathbf{y_*}$ and $\mathbf{z_*}$ ### The Partial Correlation Coefficients The square of the partial correlation coefficient is $$r_{y,z}^{*2}=\frac{(\mathbf{z_*}' \mathbf{y_*})^2}{(\mathbf{z_*}'\mathbf{z_*})(\mathbf{y_*}'\mathbf{y_*})}$$ ### Sum of Squared Residuals ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/theorem_3.5.png) ### Goodness of Fit We measure the goodness of fit of our estimates by asking whether *variation* in $\mathbf{X}$ is a good predictor of *variation* in $\mathbf{y}$. \vfill We measure *variation* of a variable as deviation from its mean. \vfill For an individual observation, we have: $$y_i=\hat{y}_i+e_i=\mathbf{x}_i'\mathbf{b}+e_i$$ \vfill Subtracting $\bar{y}$ from both sides: $$y_i-\bar{y}=\hat{y}-\bar{y}+e_i$$ \vfill Recall that $\bar{y}=\bar{\mathbf{x}}'\mathbf{b}$. Thus, $$y_i-\bar{y}=(\mathbf{x}_i-\bar{\mathbf{x}})'\mathbf{b}+e_i$$ ### Decomposition of $y_i$ ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/figure_3.4.png) ### Total Sum of Squares Notice that both $\sum_{i=1}^n{(y_i-\bar{y})}$ and $\sum_{i=1}^n{(x_i-\bar{x})}$ sum to zero. Therefore, to quantify the fit, we use the sum of squares, instead. \vfill The total variation in $\mathbf{y}$ is, thus, the sum of the squared deviations $$SST=\sum_{i=1}^n{(y_i-\bar{y})^2}$$\ ### Goodness of Fit For the full set of observations, we have $$\mathbf{M^0y}=\mathbf{M^0Xb}+\mathbf{M^0e},$$ where $\mathbf{M^0}$ is the $n\times n$ idempotent matrix that transforms observations into deviations from sample means. \vfill That is, $\mathbf{M^0}$ is the residual maker for $\mathbf{X}=\mathbf{i}$. \vfill The total sum of squares is $$\mathbf{y'M^0y}=\mathbf{b'X'M^0Xb}+\mathbf{e'e}$$ \vfill $$SST=SSR+SSE$$ \vfill Notice that this is the same partition we found before: $y=\text{Projection}+\text{Residuals}$ \vfill ### Goodness of Fit Our measure of goodness of fit is the **coefficient of determination**: $$\frac{SSR}{SST}=\frac{\mathbf{b'X'M^0Xb}}{\mathbf{y'M^0y}}=1-\frac{\mathbf{e'e}}{\mathbf{y'M^0y}}=1-\frac{\sum_{i=1}^n{e_i^2}}{\sum_{i=1}^n{(y_i-\bar{y})^2}}$$ We denote it $R^2$ and it lies between 0 and 1 (why?). ### The Adjusted R-squared One important issue with the R-squared as a measure of fit is that it never declines when adding variables to the model, even if the additional variables do not help improve the model's fit. \vfill ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/theorem_3.6.png) ### The Adjusted R-squared Based on this, we introduce an alternative measure, which incorporates a penalty for added variables to the model: $$\bar{R}^2=1-\frac{\mathbf{e}'\mathbf{e}/(n-k)}{\mathbf{y'M^0y}/(n-1)}$$ For computational purposes, we can also rewrite $\bar{R}^2$ in terms of the $R^2$: $$\bar{R}^2=1-\frac{n-1}{n-k}(1-R^2)$$ ### Comments on $\bar{R}^2$ 1. $\bar{R}^2$ may decline when a variable is added to the set of independent variables 2. $\bar{R}^2$ rises or falls depending on whether the contribution of the added variables to the fit of the regression offsets the correction for the loss of an additional degree of freedom. ## Linearly Transformed Regressions As a final algebraic analysis, we consider the case of transformed variables in the model. - For instance, changing the units of measurements from kilometers to miles or "per 1,000 inhabitants". \vfill Let's consider the Monet's paintings example again. Suppose we have two competing models representing the determinants of Monet's prices: \vfill **Model 1**: $\ln Price = \beta_1(1)+\beta_2 \ln W + \beta_3 \ln H + \varepsilon$ \vfill **Model 2**: $\ln Price = \gamma_1(1)+\beta_2 \ln WH + \beta_3 \ln W/H + u$ ### An Example: Art Appreciation Rewrite the model as \vfill **Model 1**: $\ln Price = \beta_1 x_1+\beta_2 x_2 + \beta_3 x_3 + \varepsilon$ \vfill **Model 2**: $\ln Price = \gamma_1 z_1+\beta_2 z_2 + \beta_3 z_3 + u$ \vfill We can see that $z_1=x_1$, $z_2=x_2+x_3$, and $z_3=x_2-x_3$. \vfill We can write these conditions as $Z=XP$, where $P$ is a nonsingular matrix that transforms the columns of X. - What does $P$ look like in this case? ### Linearly Transformed Regressions ![](/Users/henriquefonseca/Desktop/temp/Rmarkdown-practice/ Notes/2/theorem_3.8.png) \vfill In our art appreciation example, what is the relationship between $\mathbf{b}$ and $\mathbf{z}$?