Introduction
Eric Fischer over at Experimental Geography recently posted an intriguing analysis of San Francisco rent data last week. In the post he attempts to answer the question of whether building more rental units in San Francisco will lower rents using an extremely impressive compilation of advertised rents, total wages, total employment, and total housing units in San Francisco between 1975 and 2014, and then builds a statistical model to predict rent prices from per capita wages, total employment and number of building units. From the model, Mr. Fischer concludes that rising wages and employment coupled with very few new housing units being built are responsible for rents rising faster than inflation, but that building more housing units (around 1.5% a year) would help stabilize rents in real terms.I truly admire the dogged and obsessive work it took to compile this data set, as well as the guts to try to answer an important policy question using actual data. And Mr. Fischer's conclusion does make intuitive sense.
However, being a skeptic and a stickler for good statistical procedure, I was concerned that Mr. Fischer does not display any goodness of fit statistics beyond coefficient of determination (R^2) values. As statisticians have warned for decades, a high R^2 value by itself does not indicate a statistically significant regression model. Similarly, an increased R^2 value resulting from adding more explanatory variables does not mean the added variables significantly improved the model; it can also mean that the model is becoming over-fit. Statistical text books and references of all stripes agree that additional statistical tests are required to judge goodness of fit when adding variables to a regression model, although they sometimes disagree about the correct methods.
Luckily, Mr. Fischer has been gracious and open enough to post his raw data on GitHub, so that stats geeks like me can verify and improve the analysis. I applaud this, given Mr. Fischer's self professed lack of statistical expertise and the ongoing replication crisis in science. Given this opportunity, I therefore set out to do the following using the R statistical package and Mr. Fischer's data:
- Independently reproduce Mr. Fischer's reported modeling results.
- Conduct tests to determine which of Mr. Fischer's models (if any) are the "best fit" model.
- Re-analyze the data to account for the time-series nature of the data.
Mixed Effects Models and Extensions in Ecology with R. 2009. Zuur, Ieno, Walker, Saveliev and Smith. Springer
Ecological Models and Data in R. 2008. Benjamin M. Bolker. Princeton.
1. Reproducing the Results
The data set Mr. Fischer compiled consists of the following columns: Year, median rent, housing units, net new housing, employment, total wages, and CPI. The CPI is in terms of 1984 dollars. While housing unit counts go back to 1906 and median rents go back to 1948, employment and total wages only go back to 1975.From reading his post carefully, it appears the analysis proceeded like this:
- Filter data to include only 1975-2014.
- Adjust median rent and total wages for inflation to real 1984 dollars by dividing them by the CPI.
- Calculate Annual Per Capita Wages by dividing Total Wages by Employment
# Get data set using file chooser dialogue boxIn his post, Mr. Fischer modeled median rent two different ways. First, he modeled median rent as a function of per capita wages and employment, which we will call the "simple" model. Second, he modeled median rent a function of per capita wages, employment, AND total housing inventory, which we will call the "Housing Inventory" model.
SF <- read.csv(file.choose(), blank.lines.skip = TRUE )
# Get the time series with all complete data - 1975 to 2014
SF_gt1974 <- subset(SF, year > 1974 )
SF_75to14 <- subset(SF_gt1974, year < 2015 )
# Adjust median rent and total wages to 1984 dollars using CPI
SF_75to14$real84_median_rent <- SF_75to14$median_rent / SF_75to14$CPI
SF_75to14$real84_totalwages <- SF_75to14$total_wages / SF_75to14$CPI
# Calculate CPI-adjusted annual per capita wages
SF_75to14$real84_avg_ann_wages <- (SF_75to14$real84_totalwages / SF_75to14$employment)
In both cases, he used a log-log model, which is to say that he took the logarithm (presumably base 10) of the independent variable and all dependent variables in order to transform a non-linear multiplicative model into a linear model. While the justification for doing so can be questioned, we avoid that for now and focus on reproducing his results.
1A. Reproducing the "Simple" Model
The following code fits the data to the simple model:rentmodel_simple <-lm(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages),Here are the results:
data=SF_75to14)
summary(rentmodel_simple)
Call:Four things are apparent from these results. First, the F-statistic for the overall fit is highly significant. Secondly, the coefficient of determination (R^2) is ~0.94, which is similar to the 0.92 Mr. Fischer reported in the comments to his post. Third, the employment and Avg_wages coefficients (which are the same as the exponents in the multiplicative model) are very similar to the exponents reported on the graph of the simple model. Fourth, the coefficients on all terms are significantly different from zero. This is strong evidence that this model fit is very similar to Fischer's simple model. But what about the intercept coefficient?
lm(formula = log10(real84_median_rent) ~ log10(employment) +
log10(real84_avg_ann_wages), data = SF_75to14)
Residuals:
Min 1Q Median 3Q Max
-0.064949 -0.023399 -0.005385 0.014964 0.116402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.77231 0.99852 -10.788 5.57e-13 ***
log10(employment) 1.33400 0.18848 7.078 2.23e-08 ***
log10(real84_avg_ann_wages) 1.35093 0.07236 18.669 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03647 on 37 degrees of freedom
Multiple R-squared: 0.9418, Adjusted R-squared: 0.9386
F-statistic: 299.1 on 2 and 37 DF, p-value: < 2.2e-16
Mr. Fischer reports a value of 2.03118e-11 as a "static" multiplier on the graph of his simple model. In order to compare this to our intercept coefficient, we need to transform it to the same "log10" scale as is used in fitting the R model. Log10(2.03118e-11) = -10.69225. While not the same, this is very close to our intercept coefficient. Not knowing the details of the fitting algorithm Mr. Fischer used, we conclude that we have reproduced substantially equivalent results for the simple model using the same data. Whether or not this is the best fit model and whether or not model assumptions have been violated will be addressed later.
1B. Reproducing the "Housing Inventory" Model
The following code fits the data to the "Housing Inventory" model:rentmodel_inventory <-lm(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)The results:
+log10(housing_units), data=SF_75to14)
summary(rentmodel_inventory)
Call:Now, lets compare the coefficient values of the R model to Mr. Fischer's model, including the transformed Intercept coefficient:
lm(formula = log10(real84_median_rent) ~ log10(employment) +
log10(real84_avg_ann_wages) + log10(housing_units), data = SF_75to14)
Residuals:
Min 1Q Median 3Q Max
-0.067660 -0.018309 -0.000776 0.012929 0.103568
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9019 3.8627 -0.233 0.8167
log10(employment) 1.5369 0.1912 8.038 1.50e-09 ***
log10(real84_avg_ann_wages) 1.8729 0.2094 8.945 1.12e-10 ***
log10(housing_units) -2.4115 0.9161 -2.632 0.0124 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03386 on 36 degrees of freedom
Multiple R-squared: 0.9512, Adjusted R-squared: 0.9471
F-statistic: 233.7 on 3 and 36 DF, p-value: < 2.2e-16
R Model Fischer ModelOnce again, the coefficients are very similar, except that our intercept is ~ 0.85 standard errors lower than Mr. Fischer's model. But this is still very similar. We can now conclude that we have reproduced substantially equivalent results for the "Housing Inventory" model using the same data.
Employment 1.5369 1.54156
Per Capita Wages 1.8729 1.89709
Housing Units -2.4115 -2.52214
Intercept -0.9019 2.383505
We now turn our attention to evaluating which of these two models is the better fit.
2. Determining Mr. Fischer's "Best Fit" Model
Two commonly accepted ways to compare a simpler model to a more complex one are Likelihood ratio testing, and AIC comparisons (Zuur et. al. 2009). A very convenient way to do it in R is by refitting the linear models in using gls (generalized least squares) in the nlme package and then comparing them using the anova command. The model coefficients are exactly the same using either the lm or gls fit. Here is the code:library(nlme)NERD NOTE: Strictly speaking, the gls models both need to be fit using the maximum likelihood ("ML") method rather than the default restricted maximum likelihood ("REML") method for the Likelihood ratio test result to be valid when comparing predictor variables. (Zuur et al. 2009)
rentmodel_simple <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
, data=SF_75to14, method = "ML" )
rentmodel_inventory <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)
+ log10(housing_units), data=SF_75to14, method = "ML" )
anova( rentmodel_simple, rentmodel_inventory)
And the results:
Model df AIC BIC logLik Test L.Ratio p-valueThe significant Likelihood ratio test result tells us that the simpler model has significantly less explanatory power than the more complex "Housing Inventory" model even after accounting for the number of explanatory variables. Similarly, the more complex "Housing Inventory" model has a AIC value that is more than 2 units lower than the simpler model, indicating that the "Housing Inventory" model has more predictive value. (Remember, larger negative numbers are lower than smaller negative numbers.) We can conclude that the "Housing Inventory" model is a better fit than the "Simple" model so long as we have not violated any model assumptions. However, as we are about to see, there is at least one significant assumption being violated.
rentmodel_simple 1 4 -146.5009 -139.7454 77.25047
rentmodel_inventory 2 5 -151.5418 -143.0974 80.77092 1 vs 2 7.040897 0.008
3. Addressing Violated Model Assumptions.
A very important assumption of linear regression is that each observation of the response variable is independent of every other observation of the response variable. In this case, that would require us to believe that the median rent observed in any given year has no relationship to the median rent from the year before. Such a belief would be utterly absurd! Therefore, we can not trust the statistical significance of any of our prior analysis until we correct for the relationships (auto-correlation) in the data.Furthermore, there is another problem. What if some other, not yet identified, but more or less constant factor is contributing to the steady rise in rents, such as steadily escalating property taxes, land values, or decreasing household sizes? How do we differentiate these unspecified general long-term trends from the specific predictor variables we are examining? Typically the solution is to include a time variable in the model to reflect unexplained trends in the data. In our case, that is the year variable.
To account for auto-correlation and any possible unexplained trends, we will add "year" to the predictor variables, test to see if auto-correlation is present, and if it is present, and then add an auto-correlation model to the variance structure of the model. The general model fitting procedure follows the protocol in Chapter 5 of Zuur et al. (2009) and the auto-correlation model selection procedure follows that in Chapter 6 of Zuur et al. (2009).
Step 1. Specify FULL ("Better than Optimal") Model
rentmodel_FULL <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_Avg_wages)This is the model with all possible predictors (including year), regardless of whether they are statistically significant. Note that this is fit using "REML" for the purpose of examining of comparing auto-correlation variance structures.
+ log10(housing_units) + year, data=SF_75to14, method = "REML" )
Step 2. Visually Test for Auto-Correlation
We now generate an graph of the Auto-Correlation Function (ACF) of theE <- residuals(rentmodel_FULL, type = "normalized")The results are this plot:
acf(E, na.action = na.pass, main = "Auto-correlation plot for residuals")
This graph shows the correlation of the residuals of the model at different lags between years. The dotted line shows the generally accepted significance threshold. There is an apparent cyclical pattern, and several of the lags are above the significance threshold. This is evidence of possible auto-correlation and thus a violation of the independence assumption of linear regression. We therefore must try adding an autocorrelation function to the variance structure of the model.
Step 3. Add Time Auto-Correlation to Model and Test
We now create a model with a simple, 1 year auto-correlation structure, and compare it to the FULL model:rentmodel_AR1 <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_Avg_wages)The results:
+ log10(housing_units) + year, data=SF_75to14, method = "REML",
correlation = corAR1(form =~ year) )
anova(rentmodel_FULL , rentmodel_AR1)
Model df AIC BIC logLik Test L.Ratio p-valueAdding auto-correlation significantly improves the fit of the model and also lowers the AIC by almost 6 units. This confirms that we must include auto-correlation in the model. Further tests for additional 2 or more years auto-correlation structures (not shown here) did not improve fit. We have found the optimal variance structure.
rentmodel_FULL 1 6 -132.4246 -123.0926 72.21232
rentmodel_AR1 2 7 -138.1264 -127.2390 76.06320 1 vs 2 7.701762 0.0055
Step 4. Backwards Selection of Predictor Variables To Find Optimal Model
Now that we have corrected for the time-series nature of our data, we re-fit the AR1 model using maximum likelihood, examine the parameters for the one with least significance, and test to see whether we can eliminate it from the model.rentmodel_AR1_full <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_Avg_wages)Model results (parts omitted):
+ log10(housing_units) + year, data=SF_75to14, method = "ML",
correlation = corAR1(form =~ year) )
summary(rentmodel_AR1_full)
Generalized least squares fit by maximum likelihoodThe year predictor variable appears not to be significant (p > 0.05). We should see if we can eliminate it using the likelihood ratio test:
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages) + log10(housing_units) + year
Data: SF_75to14
AIC BIC logLik
-156.754 -144.9319 85.377
Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
Phi
0.3210725
Coefficients:
Value Std.Error t-value p-value
(Intercept) 4.299450 5.531693 0.777240 0.4422
log10(employment) 1.236149 0.248592 4.972610 0.0000
log10(real84_avg_ann_wages) 1.566711 0.293898 5.330801 0.0000
log10(housing_units) -6.284575 2.490416 -2.523504 0.0163
year 0.009697 0.005350 1.812365 0.0785
Standardized residuals:
Min Q1 Med Q3 Max
-2.63051220 -0.57265674 -0.04795894 0.59327777 2.90406053
Residual standard error: 0.03018757
Degrees of freedom: 40 total; 35 residual
rentmodel_AR1_drop_year <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)Results:
+ log10(housing_units), data=SF_75to14, method = "ML",
correlation = corAR1(form =~ year) )
anova(rentmodel_AR1_full, rentmodel_AR1_drop_year)
Model df AIC BIC logLik Test L.Ratio p-valueAlthough it is close, the p-value is greater than 0.05, so removing "year" from the model does not significantly reduce the fit. The very small reduction in AIC also supports that. Year is eliminated.
rentmodel_AR1_full 1 7 -156.754 -144.932 85.3770
rentmodel_AR1_drop_year 2 6 -155.315 -145.182 83.6576 1 vs 2 3.4388 0.0637
Lets examine the parameters of the model without year (some parts removed for brevity):
> summary(rentmodel_AR1_drop_year)Examining the p-value of the housing_units predictor variable, we see that it is not significant. So we test to see if the fit changes significantly when we drop it:
Generalized least squares fit by maximum likelihood
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages) + log10(housing_units)
...
Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
Phi
0.3805847
Coefficients:
Value Std.Error t-value p-value
(Intercept) -1.202616 5.067569 -0.237316 0.8138
log10(employment) 1.398091 0.243139 5.750181 0.0000
log10(real84_avg_ann_wages) 1.831434 0.273964 6.684941 0.0000
log10(housing_units) -2.180096 1.194698 -1.824810 0.0763
...
Residual standard error: 0.03225449
Degrees of freedom: 40 total; 36 residual
> rentmodel_AR1_drop_units <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)Removing the Housing Units variable also does not significantly reduce the fit, nor does it raise the AIC very much. This is important. LET ME REPEAT: We have just found that the Housing Units variable IS NOT SIGNIFICANT once we correct for the auto-correlation that comes from this being a time series.
+ , data=SF_75to14, method = "ML", correlation = corAR1(form =~ year) )
> anova(rentmodel_AR1_drop_year, rentmodel_AR1_drop_units)
Model df AIC BIC logLik Test L.Ratio p-value
rentmodel_AR1_drop_year 1 6 -155.3152 -145.1819 83.65759
rentmodel_AR1_drop_units 2 5 -154.2168 -145.7724 82.10842 1 vs 2 3.098339 0.0784
Lets examine the coefficients of new model, which is very similar to Mr. Fischer's original "simple" model, to make sure all the remaining variables are significant.
> summary(rentmodel_AR1_drop_units)Every single remaining variable is VERY HIGHLY significant. We can not eliminate any other variables. We have now arrived at the optimal, most parsimonious model.
Generalized least squares fit by maximum likelihood
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages)
Data: SF_75to14
...
Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
Phi
0.4759222
Coefficients:
Value Std.Error t-value p-value
(Intercept) -9.977596 1.3093363 -7.620346 <0.001
log10(employment) 1.178835 0.2495880 4.723126 <0.001
log10(real84_avg_ann_wages) 1.372409 0.1150156 11.932379 <0.001
...
Residual standard error: 0.03520936
Degrees of freedom: 40 total; 37 residual
Step 5. Calculate Final Model Parameters Using REML
Although we have arrived at the optimal model, the final model parameters should be obtained using REML fitting (Zuur et al. 2009). They are highlighted in yellow:> rentmodel_AR1_drop_units_REML <- gls(log10(real84_median_rent)~log10(employment)+log10(real84_avg_ann_wages)From this, we can now write the formula for the optimal model for predicting rents in San Fransisco. After transforming the intercept term, we get:
+ , data=SF_75to14, method = "REML", correlation = corAR1(form =~ year) )
> summary(rentmodel_AR1_drop_units_REML)
Generalized least squares fit by REML
Model: log10(real84_median_rent) ~ log10(employment) + log10(real84_avg_ann_wages)
Data: SF_75to14
AIC BIC logLik
-143.489 -135.4344 76.7445
Correlation Structure: AR(1)
Formula: ~year
Parameter estimate(s):
Phi
0.5870734
Coefficients:
Value Std.Error t-value p-value
(Intercept) -9.682362 1.4559104 -6.650383 <0.001
log10(employment) 1.115140 0.2770300 4.025342 3e-04
log10(real84_avg_ann_wages) 1.388164 0.1405178 9.878924 <0.001
Correlation:
(Intr) lg10()
log10(employment) -0.919
log10(real84_avg_ann_wages) 0.009 -0.402
Standardized residuals:
Min Q1 Med Q3 Max
-1.8031291 -0.4762140 -0.1536071 0.2795488 3.0131554
Residual standard error: 0.03999162
Degrees of freedom: 40 total; 37 residual
Rent = 0.0000000002077963 X (Employment)^(1.115) X (Per_Capita_Annual_Wages)^(1.388)
Conclusions
In conclusion, we reproduced Mr. Fischer's model fitting results using different software, and we have re-analyzed the data using standard procedures published by recognized statisticians.Using those methods and Mr. Fischer's data, we find that median rents in San Francisco are best predicted by average per capita annual income and total employment. We further find that there IS NOT sufficient evidence that total housing unit inventory has a significant effect on the median rent in San Francisco.
It is possible that our sample size is not large enough to detect a relationship between housing unit inventory and median rent, if it exists. It is also possible that housing inventory did not change enough during the sample period to demonstrate the impact of changes in inventory on median rent. Near significant results in some tests suggest that either of these might be possibilities, and suggest that more investigation is warranted as more data become available. But in any case, the magnitude of impacts on median rent of rapidly increasing employment and rapidly increasing per capita income are much larger than the impact of building more housing units.
The policy implications of this data are limited, but the lack of a significant impact of housing inventory on median rents suggest that affordable housing advocates should concentrate their efforts on creating dedicated affordable housing rather than trying to build more total housing units.
****** UPDATE May 27 2016 ******
After some reflection and conversation with others, I feel compelled to offer one additional caveat, and to scale back my policy implication a bit. First, I need to emphasize that what I have established here are merely correlations. Mr. Fischer, myself, or anyone else analyzing this data have not established causation! While our results make sense intuitively, it's important to take them with a grain of salt because we haven't actually established the mechanisms linking rising rents to rising income and employment. It is evidence, but it is not definitive.
Secondly, my policy implication went too far. It should be this: housing affordability advocates should be aware that increasing the total number of all housing units may not be effective in decreasing median rents, and should perhaps focus their efforts elsewhere.