More Useful, Less Wrong

Introduction

Eric Fischer over at Experimental Geography recently posted an intriguing analysis of San Francisco rent data last week. In the post he attempts to answer the question of whether building more rental units in San Francisco will lower rents using an extremely impressive compilation of advertised rents, total wages, total employment, and total housing units in San Francisco between 1975 and 2014, and then builds a statistical model to predict rent prices from per capita wages, total employment and number of building units. From the model, Mr. Fischer concludes that rising wages and employment coupled with very few new housing units being built are responsible for rents rising faster than inflation, but that building more housing units (around 1.5% a year) would help stabilize rents in real terms.

I truly admire the dogged and obsessive work it took to compile this data set, as well as the guts to try to answer an important policy question using actual data. And Mr. Fischer's conclusion does make intuitive sense.

However, being a skeptic and a stickler for good statistical procedure, I was concerned that Mr. Fischer does not display any goodness of fit statistics beyond coefficient of determination (R^2) values. As statisticians have warned for decades, a high R^2 value by itself does not indicate a statistically significant regression model. Similarly, an increased R^2 value resulting from adding more explanatory variables does not mean the added variables significantly improved the model; it can also mean that the model is becoming over-fit. Statistical text books and references of all stripes agree that additional statistical tests are required to judge goodness of fit when adding variables to a regression model, although they sometimes disagree about the correct methods.

Luckily, Mr. Fischer has been gracious and open enough to post his raw data on GitHub, so that stats geeks like me can verify and improve the analysis. I applaud this, given Mr. Fischer's self professed lack of statistical expertise and the ongoing replication crisis in science. Given this opportunity, I therefore set out to do the following using the R statistical package and Mr. Fischer's data:

Independently reproduce Mr. Fischer's reported modeling results.
Conduct tests to determine which of Mr. Fischer's models (if any) are the "best fit" model.
Re-analyze the data to account for the time-series nature of the data.

My approach is information-theoretic, my statistical software is R, and my primary references reflect my ecology background:
Mixed Effects Models and Extensions in Ecology with R. 2009. Zuur, Ieno, Walker, Saveliev and Smith. Springer
Ecological Models and Data in R. 2008. Benjamin M. Bolker. Princeton.

1. Reproducing the Results

The data set Mr. Fischer compiled consists of the following columns: Year, median rent, housing units, net new housing, employment, total wages, and CPI. The CPI is in terms of 1984 dollars. While housing unit counts go back to 1906 and median rents go back to 1948, employment and total wages only go back to 1975.

From reading his post carefully, it appears the analysis proceeded like this:

Filter data to include only 1975-2014.
Adjust median rent and total wages for inflation to real 1984 dollars by dividing them by the CPI.
Calculate Annual Per Capita Wages by dividing Total Wages by Employment

The following R code accomplishes this:

# Get data set using file chooser dialogue box
SF <- read.csv(file.choose(), blank.lines.skip = TRUE )

# Get the time series with all complete data - 1975 to 2014
SF_gt1974 <- subset(SF, year > 1974 )
SF_75to14 <- subset(SF_gt1974, year < 2015 )

# Adjust median rent and total wages to 1984 dollars using CPI
SF_75to14$real84_median_rent <- SF_75to14$median_rent / SF_75to14$CPI
SF_75to14$real84_totalwages <- SF_75to14$total_wages / SF_75to14$CPI

# Calculate CPI-adjusted annual per capita wages
SF_75to14$real84_avg_ann_wages <- (SF_75to14$real84_totalwages / SF_75to14$employment)

In his post, Mr. Fischer modeled median rent two different ways. First, he modeled median rent as a function of per capita wages and employment, which we will call the "simple" model. Second, he modeled median rent a function of per capita wages, employment, AND total housing inventory, which we will call the "Housing Inventory" model.

In both cases, he used a log-log model, which is to say that he took the logarithm (presumably base 10) of the independent variable and all dependent variables in order to transform a non-linear multiplicative model into a linear model. While the justification for doing so can be questioned, we avoid that for now and focus on reproducing his results.

1A. Reproducing the "Simple" Model

The following code fits the data to the simple model:

rentmodel_simple <-lm(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages),
data=SF_75to14)
summary(rentmodel_simple)

Here are the results:

Call:
lm(formula = log10(real84_median_rent) ~ log10(employment) +
    log10(real84_avg_ann_wages), data = SF_75to14)

Residuals:
      Min          1Q          Median        3Q          Max
-0.064949 -0.023399 -0.005385 0.014964 0.116402

Coefficients:
                                                  Estimate     Std. Error t value       Pr(>|t|)
(Intercept)                                 -10.77231    0.99852    -10.788    5.57e-13 ***
log10(employment)                    1.33400   0.18848     7.078       2.23e-08 ***
log10(real84_avg_ann_wages)   1.35093     0.07236    18.669       < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03647 on 37 degrees of freedom
Multiple R-squared: 0.9418,     Adjusted R-squared: 0.9386
F-statistic: 299.1 on 2 and 37 DF, p-value: < 2.2e-16

Four things are apparent from these results. First, the F-statistic for the overall fit is highly significant. Secondly, the coefficient of determination (R^2) is ~0.94, which is similar to the 0.92 Mr. Fischer reported in the comments to his post. Third, the employment and Avg_wages coefficients (which are the same as the exponents in the multiplicative model) are very similar to the exponents reported on the graph of the simple model. Fourth, the coefficients on all terms are significantly different from zero. This is strong evidence that this model fit is very similar to Fischer's simple model. But what about the intercept coefficient?

Mr. Fischer reports a value of 2.03118e-11 as a "static" multiplier on the graph of his simple model. In order to compare this to our intercept coefficient, we need to transform it to the same "log10" scale as is used in fitting the R model. Log10(2.03118e-11) = -10.69225. While not the same, this is very close to our intercept coefficient. Not knowing the details of the fitting algorithm Mr. Fischer used, we conclude that we have reproduced substantially equivalent results for the simple model using the same data. Whether or not this is the best fit model and whether or not model assumptions have been violated will be addressed later.

1B. Reproducing the "Housing Inventory" Model

The following code fits the data to the "Housing Inventory" model:

rentmodel_inventory <-lm(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
+log10(housing_units), data=SF_75to14)
summary(rentmodel_inventory)

The results:

Call:
lm(formula = log10(real84_median_rent) ~ log10(employment) +
    log10(real84_avg_ann_wages) + log10(housing_units), data = SF_75to14)

Residuals:
      Min                1Q          Median         3Q           Max
-0.067660     -0.018309    -0.000776 0.012929 0.103568

Coefficients:
                                                    Estimate   Std. Error t value     Pr(>|t|)
(Intercept)                                   -0.9019      3.8627    -0.233     0.8167
log10(employment)                     1.5369      0.1912       8.038     1.50e-09 ***
log10(real84_avg_ann_wages)   1.8729      0.2094        8.945     1.12e-10 ***
log10(housing_units)                  -2.4115     0.9161       -2.632     0.0124 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03386 on 36 degrees of freedom
Multiple R-squared: 0.9512,     Adjusted R-squared: 0.9471
F-statistic: 233.7 on 3 and 36 DF, p-value: < 2.2e-16

Now, lets compare the coefficient values of the R model to Mr. Fischer's model, including the transformed Intercept coefficient:

                                             R Model                  Fischer Model
Employment                           1.5369                     1.54156
Per Capita Wages                   1.8729                     1.89709
Housing Units                        -2.4115                   -2.52214
Intercept                                 -0.9019                    2.383505

Once again, the coefficients are very similar, except that our intercept is ~ 0.85 standard errors lower than Mr. Fischer's model. But this is still very similar. We can now conclude that we have reproduced substantially equivalent results for the "Housing Inventory" model using the same data.

We now turn our attention to evaluating which of these two models is the better fit.

2. Determining Mr. Fischer's "Best Fit" Model

Two commonly accepted ways to compare a simpler model to a more complex one are Likelihood ratio testing, and AIC comparisons (Zuur et. al. 2009). A very convenient way to do it in R is by refitting the linear models in using gls (generalized least squares) in the nlme package and then comparing them using the anova command. The model coefficients are exactly the same using either the lm or gls fit. Here is the code:

library(nlme)
rentmodel_simple <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
, data=SF_75to14, method = "ML" )
rentmodel_inventory <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
+ log10(housing_units), data=SF_75to14, method = "ML" )
anova( rentmodel_simple, rentmodel_inventory)

NERD NOTE: Strictly speaking, the gls models both need to be fit using the maximum likelihood ("ML") method rather than the default restricted maximum likelihood ("REML") method for the Likelihood ratio test result to be valid when comparing predictor variables. (Zuur et al. 2009)

And the results:

                               Model df       AIC          BIC         logLik      Test     L.Ratio    p-value
rentmodel_simple         1       4 -146.5009 -139.7454 77.25047
rentmodel_inventory    2       5 -151.5418 -143.0974 80.77092 1 vs 2 7.040897   0.008

The significant Likelihood ratio test result tells us that the simpler model has significantly less explanatory power than the more complex "Housing Inventory" model even after accounting for the number of explanatory variables. Similarly, the more complex "Housing Inventory" model has a AIC value that is more than 2 units lower than the simpler model, indicating that the "Housing Inventory" model has more predictive value. (Remember, larger negative numbers are lower than smaller negative numbers.) We can conclude that the "Housing Inventory" model is a better fit than the "Simple" model so long as we have not violated any model assumptions. However, as we are about to see, there is at least one significant assumption being violated.

3. Addressing Violated Model Assumptions.

A very important assumption of linear regression is that each observation of the response variable is independent of every other observation of the response variable. In this case, that would require us to believe that the median rent observed in any given year has no relationship to the median rent from the year before. Such a belief would be utterly absurd! Therefore, we can not trust the statistical significance of any of our prior analysis until we correct for the relationships (auto-correlation) in the data.

Furthermore, there is another problem. What if some other, not yet identified, but more or less constant factor is contributing to the steady rise in rents, such as steadily escalating property taxes, land values, or decreasing household sizes? How do we differentiate these unspecified general long-term trends from the specific predictor variables we are examining? Typically the solution is to include a time variable in the model to reflect unexplained trends in the data. In our case, that is the year variable.

To account for auto-correlation and any possible unexplained trends, we will add "year" to the predictor variables, test to see if auto-correlation is present, and if it is present, and then add an auto-correlation model to the variance structure of the model. The general model fitting procedure follows the protocol in Chapter 5 of Zuur et al. (2009) and the auto-correlation model selection procedure follows that in Chapter 6 of Zuur et al. (2009).

Step 1. Specify FULL ("Better than Optimal") Model

rentmodel_FULL <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_Avg_wages)
+ log10(housing_units) + year, data=SF_75to14, method = "REML" )

This is the model with all possible predictors (including year), regardless of whether they are statistically significant. Note that this is fit using "REML" for the purpose of examining of comparing auto-correlation variance structures.

Step 2. Visually Test for Auto-Correlation

We now generate an graph of the Auto-Correlation Function (ACF) of the

E <- residuals(rentmodel_FULL, type = "normalized")
acf(E, na.action = na.pass, main = "Auto-correlation plot for residuals")

The results are this plot:

This graph shows the correlation of the residuals of the model at different lags between years. The dotted line shows the generally accepted significance threshold. There is an apparent cyclical pattern, and several of the lags are above the significance threshold. This is evidence of possible auto-correlation and thus a violation of the independence assumption of linear regression. We therefore must try adding an autocorrelation function to the variance structure of the model.

Step 3. Add Time Auto-Correlation to Model and Test

We now create a model with a simple, 1 year auto-correlation structure, and compare it to the FULL model:

rentmodel_AR1 <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_Avg_wages)
+ log10(housing_units) + year, data=SF_75to14, method = "REML",
correlation = corAR1(form =~ year) )
anova(rentmodel_FULL , rentmodel_AR1)

The results:

                            Model df       AIC          BIC           logLik     Test     L.Ratio    p-value
rentmodel_FULL     1      6 -132.4246   -123.0926   72.21232
rentmodel_AR1        2     7 -138.1264   -127.2390   76.06320 1 vs 2 7.701762   0.0055

Adding auto-correlation significantly improves the fit of the model and also lowers the AIC by almost 6 units. This confirms that we must include auto-correlation in the model. Further tests for additional 2 or more years auto-correlation structures (not shown here) did not improve fit. We have found the optimal variance structure.

Step 4. Backwards Selection of Predictor Variables To Find Optimal Model

Now that we have corrected for the time-series nature of our data, we re-fit the AR1 model using maximum likelihood, examine the parameters for the one with least significance, and test to see whether we can eliminate it from the model.

rentmodel_AR1_full <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_Avg_wages)
+ log10(housing_units) + year, data=SF_75to14, method = "ML",
correlation = corAR1(form =~ year) )
summary(rentmodel_AR1_full)

Model results (parts omitted):

Generalized least squares fit by maximum likelihood
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages) +      log10(housing_units) + year
Data: SF_75to14
       AIC       BIC logLik
-156.754 -144.9319 85.377

Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
      Phi
0.3210725

Coefficients:
                                                  Value          Std.Error     t-value      p-value
(Intercept)                         4.299450 5.531693    0.777240 0.4422
log10(employment)                    1.236149 0.248592 4.972610 0.0000
log10(real84_avg_ann_wages) 1.566711   0.293898 5.330801   0.0000
log10(housing_units)                 -6.284575 2.490416 -2.523504 0.0163
year                                             0.009697 0.005350   1.812365   0.0785

Standardized residuals:
        Min                Q1                Med              Q3              Max
-2.63051220 -0.57265674 -0.04795894 0.59327777 2.90406053

Residual standard error: 0.03018757
Degrees of freedom: 40 total; 35 residual

The year predictor variable appears not to be significant (p > 0.05). We should see if we can eliminate it using the likelihood ratio test:

rentmodel_AR1_drop_year <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
+ log10(housing_units), data=SF_75to14, method = "ML",
correlation = corAR1(form =~ year) )

anova(rentmodel_AR1_full, rentmodel_AR1_drop_year)

Results:

                                        Model df    AIC       BIC       logLik   Test    L.Ratio p-value
rentmodel_AR1_full               1      7 -156.754 -144.932 85.3770
rentmodel_AR1_drop_year     2      6 -155.315 -145.182 83.6576 1 vs 2 3.4388 0.0637

Although it is close, the p-value is greater than 0.05, so removing "year" from the model does not significantly reduce the fit. The very small reduction in AIC also supports that. Year is eliminated.

Lets examine the parameters of the model without year (some parts removed for brevity):

> summary(rentmodel_AR1_drop_year)
Generalized least squares fit by maximum likelihood
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages) +      log10(housing_units)
...

Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
      Phi
0.3805847

Coefficients:
                                                  Value          Std.Error   t-value       p-value
(Intercept)                                 -1.202616   5.067569 -0.237316 0.8138
log10(employment)                   1.398091   0.243139   5.750181   0.0000
log10(real84_avg_ann_wages) 1.831434   0.273964   6.684941   0.0000
log10(housing_units)                -2.180096   1.194698 -1.824810 0.0763
...
Residual standard error: 0.03225449
Degrees of freedom: 40 total; 36 residual

Examining the p-value of the housing_units predictor variable, we see that it is not significant. So we test to see if the fit changes significantly when we drop it:

> rentmodel_AR1_drop_units <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
+                        , data=SF_75to14, method = "ML", correlation = corAR1(form =~ year) )
> anova(rentmodel_AR1_drop_year, rentmodel_AR1_drop_units)
                                           Model df   AIC          BIC           logLik       Test    L.Ratio p-value
rentmodel_AR1_drop_year    1     6 -155.3152 -145.1819 83.65759
rentmodel_AR1_drop_units   2     5 -154.2168 -145.7724 82.10842 1 vs 2 3.098339 0.0784

Removing the Housing Units variable also does not significantly reduce the fit, nor does it raise the AIC very much. This is important. LET ME REPEAT: We have just found that the Housing Units variable IS NOT SIGNIFICANT once we correct for the auto-correlation that comes from this being a time series.

Lets examine the coefficients of new model, which is very similar to Mr. Fischer's original "simple" model, to make sure all the remaining variables are significant.

> summary(rentmodel_AR1_drop_units)
Generalized least squares fit by maximum likelihood
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages)
Data: SF_75to14
...
Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
      Phi
0.4759222

Coefficients:
                                                  Value         Std.Error    t-value       p-value
(Intercept)                                 -9.977596   1.3093363   -7.620346     <0.001
log10(employment)                   1.178835   0.2495880    4.723126      <0.001
log10(real84_avg_ann_wages) 1.372409   0.1150156   11.932379     <0.001
...
Residual standard error: 0.03520936
Degrees of freedom: 40 total; 37 residual

Every single remaining variable is VERY HIGHLY significant. We can not eliminate any other variables. We have now arrived at the optimal, most parsimonious model.

Step 5. Calculate Final Model Parameters Using REML

Although we have arrived at the optimal model, the final model parameters should be obtained using REML fitting (Zuur et al. 2009). They are highlighted in yellow:

> rentmodel_AR1_drop_units_REML <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
+                        , data=SF_75to14, method = "REML", correlation = corAR1(form =~ year) )
> summary(rentmodel_AR1_drop_units_REML)
Generalized least squares fit by REML
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages)
Data: SF_75to14
       AIC       BIC logLik
-143.489 -135.4344 76.7445

Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
      Phi
0.5870734

Coefficients:
                                                     Value       Std.Error        t-value    p-value
(Intercept)                                 -9.682362 1.4559104   -6.650383   <0.001
log10(employment)                   1.115140   0.2770300    4.025342   3e-04
log10(real84_avg_ann_wages) 1.388164 0.1405178    9.878924   <0.001

Correlation:
                                                    (Intr) lg10()
log10(employment)                   -0.919
log10(real84_avg_ann_wages) 0.009 -0.402

Standardized residuals:
       Min         Q1        Med         Q3        Max
-1.8031291 -0.4762140 -0.1536071 0.2795488 3.0131554

Residual standard error: 0.03999162
Degrees of freedom: 40 total; 37 residual

From this, we can now write the formula for the optimal model for predicting rents in San Fransisco. After transforming the intercept term, we get:

Rent = 0.0000000002077963 X (Employment)^(1.115) X (Per_Capita_Annual_Wages)^(1.388)

Conclusions

In conclusion, we reproduced Mr. Fischer's model fitting results using different software, and we have re-analyzed the data using standard procedures published by recognized statisticians.
Using those methods and Mr. Fischer's data, we find that median rents in San Francisco are best predicted by average per capita annual income and total employment. We further find that there IS NOT sufficient evidence that total housing unit inventory has a significant effect on the median rent in San Francisco.

It is possible that our sample size is not large enough to detect a relationship between housing unit inventory and median rent, if it exists. It is also possible that housing inventory did not change enough during the sample period to demonstrate the impact of changes in inventory on median rent. Near significant results in some tests suggest that either of these might be possibilities, and suggest that more investigation is warranted as more data become available. But in any case, the magnitude of impacts on median rent of rapidly increasing employment and rapidly increasing per capita income are much larger than the impact of building more housing units.

The policy implications of this data are limited, but the lack of a significant impact of housing inventory on median rents suggest that affordable housing advocates should concentrate their efforts on creating dedicated affordable housing rather than trying to build more total housing units.

****** UPDATE May 27 2016 ******

After some reflection and conversation with others, I feel compelled to offer one additional caveat, and to scale back my policy implication a bit.

First, I need to emphasize that what I have established here are merely correlations. Mr. Fischer, myself, or anyone else analyzing this data have not established causation! While our results make sense intuitively, it's important to take them with a grain of salt because we haven't actually established the mechanisms linking rising rents to rising income and employment. It is evidence, but it is not definitive.

Secondly, my policy implication went too far. It should be this: housing affordability advocates should be aware that increasing the total number of all housing units may not be effective in decreasing median rents, and should perhaps focus their efforts elsewhere.