In short, the p-values represent the probability of the coefficient values being 0. The exact coefficients are unknown due to sampling uncertainty, and the confidence intervals represent the bounds of the population coefficient given a confidence range (e.g., 95% CI as above). If the coefficient is 0, then the independent variable can be deemed insignificant. Therefore, if the probability of the coefficient being 0 is small(e.g., 0.05), we can lean in favor of the alternative that the feature is significant.
The p-values for the coefficients are derived from a hypothesis test. The null hypothesis states the coefficient is zero(insignificant), and the alternative claims the coefficient is not zero(significant). Hypothesis tests work by “proof by disproof.” The p-value is the probability that the coefficient is 0 given that the null statement, coefficient = 0, is true. If the p-value is small, we can reject the null hypothesis in favor of the alternative.
The output of the test is a t-score which is then translated to a p-value from a t-value table. The two-tailed t-test statistic can be calculated as such:
x_bar stands for the mean coefficients of the sample and the SE stands for the coefficients’ standard error. Simply dividing the coefficient by the standard error should give you the t-score. The p-value is then found in a t-value chart with the t-score and the degree of freedom(which equals the number of observations minus the number of columns minus 1).
If the p-value is less than some assigned alpha(e.g., 0.05), we can reject the hypothesis that the population coefficient is 0 and lean toward the alternative that the feature is significant. The confidence interval for a coefficient in the above OLS summary is a 95% confidence range of the population coefficient. The coefficients are calculated as such:
It’s important to note that an increase in sample observations will tighten the confidence interval as the standard error of the coefficient will decrease. In addition, t-scores will increase multiplicatively as the number of sample observations increase, therefore lowering the p-value. Intuitively, an increase in observations will decrease the variation of the samples. However, increasing observations will not reduce the standard error of the regression, only the coefficient.
The standard error of coefficient is calculated with the sum of squared residuals, degrees of freedom, and the inverse of X_transpose*X.
I hope this blog has helped you interpret the OLS summary. The main takeaways from this blog are:
- P-value is the by-product of the hypothesis test that the population coefficient is 0(insignificant).
- Coefficient Confidence Intervals represent the range of the population coefficient. The larger number of samples, the tighter the confidence interval of the coefficients(NOT regression/model) will be.
- Standard Error of Coefficients can be calculated using fancy linear algebra.