Picking a Realtor — with DATA. Part 2: Analysis
This post covers the analysis of the real-estate data we obtained in part 1. We’ll use a jupyter notebook to look at the data. The source code for this notebook is (as usual) on GitHub.
First we’ll import the packages we’ll be using:
import pandas as pd
import numpy as np
import datetime as dt
from matplotlib import pyplot as plt
import folium
from folium import plugins
import seaborn as sns
from sklearn.metrics.pairwise import euclidean_distancespd.options.display.max_rows = 200
pd.options.display.max_columns = 100
Next we’ll import the data itself. Note that the data was obtained by searching a specific map rectangle for homes that had sold in the previous 3 years for between $2M and $3.5M. This search was run in the summer of 2020.
df = pd.read_csv('comps-2-3.5M.csv').set_index('zpid')
df.drop(columns='Unnamed: 0',inplace=True)
The data has a number of date fields in it that need to be converted to datetime
:
df['dateSold'] = pd.to_datetime(df['dateSold']).dt.date
df['listedAt'] = pd.to_datetime(df['listedAt']).dt.date
df['yearSold'] = df['dateSold'].apply(lambda x: x.year)
Next we add some features to the data.
df['zEstimate ratio'] =
df['sellingPrice']/df['zEstimate_at_listing']
df['listing ratio'] = df['sellingPrice']/df['firstListingPrice']
df['days to sell'] = (df['dateSold']-df['listedAt']).dt.days
df['price per sf'] = df['sellingPrice']/df['area']
The zEstimate ratio is the ratio of the final selling price to the last zEstimate that was published prior to the home being listed for sale. This ratio reflects a number of factors including market conditions and the accuracy of Zillow’s price prediction algorithm but it also reflects the impact the listing agent has on the selling price. If a home is well presented online and in-person and the realtor is a strong negotiator then (in theory at least), the final selling price should reflect that effort.
Let’s look at realtor performance via a pivot table:
df.pivot_table(
index='sellerAgent',
aggfunc={
'zipcode': 'count',
'zEstimate ratio': np.mean,
'listing ratio': np.mean,
'days to sell': np.mean,
'sellingPrice': np.mean,
'dateSold': max,
'price per sf': np.mean
}).rename(columns={'zipcode': 'sales', 'dateSold': 'Most recent sale'})\
.sort_values(by='sales',ascending=False).head(20)
Note that we first filtered on a particular geographic area then filtered by selling price to create the data set we’re working with. This data does not reflect the total performance of each realtor as they could have sold many more homes outside our area or outside our price range. We’re talking about a tiny area of less than 4 square miles, basically a few neighborhoods.
Note that even though Zillow has data on millions of homes and transactions, when you drill down to a small neighborhood the total number of transactions per realtor gets to be quite small. The consequence is that nothing is really statistically significant at the realtor level.
Two agents stand out in the list above, Laura Bertolacci and Lena Griffin. They managed to get 32% and 41% uplifts respectively over Zillow’s zEstimate that existed just prior to listing. Laura has 4 sales in our filter criteria while Lena has 3. However Laura sold her houses in an average of 48 days whereas Lena took 165 days.
It’s also interesting that I recognize two of the names on this list — Sam Anagnostou (a friend of a friend) and Michelle Glaubert who was the original listing agent on the house when we had purchased it 32 years earlier. Sam has 6 transactions but he’s only getting 97% of the zEstimate on average. Michelle has 2 and is getting 102% of the zEstimate (with a 29 day sales cycle which is very quick). Michelle unfortunately hadn’t sold a house in this area and price range since March 2018.
When I was running these analyses in the summer of 2020, the COVID crisis was in full-swing. Its impact on home prices was still somewhat uncertain. It certainly added a lot of friction to the selling process by eliminating open houses and by making in-person visits much more complicated. This really emphasized the need to get professional photos and a 3D walkthrough of a clean, well-staged home. For tax reasons we also had a deadline which meant we had to close a sale by December 31, 2020, days to sell was almost as important a metric as selling price.
We ended up retaining Laura Bertolacci to sell our home.
Coming up with a Listing Price
Having retained Laura and gotten the home ready to sell, one of the biggest questions was the price we were going to list it at. Laura was pushing us to price it relatively low (relative to the zEstimate) in order to generate interest and potentially attract multiple offers. I decided to look at home sale data in our area to try to understand the market dynamics a bit better.
First I wanted to understand a bit more about how houses were selling in the neighborhood so I made some heat maps using the folium package:
# draw a map of the zEstimate ratio for all the transactions
# center on Sancho's Taqueria out of respect for their fish tacos
m = folium.Map([37.4710361,-122.2629159], zoom_start=13)# the heatmap plugin expects either a 2D or 3D array with columns [lat,long(,weight)]
# it also can't handle NaNsheatmapdata = np.array(df[['latitude','longitude','zEstimate ratio']].dropna())
m.add_child(plugins.HeatMap(heatmapdata, radius=15))
m
This yielded the following map:
This turned out not to be terribly useful at this resolution — there just weren’t enough sales to make out a pattern. The bright orange spots correspond to individual homes that sold for much higher than their pre-listing zEstimate.
Time for some charts! First the zEstimate Ratio as a function of selling price:
df.plot.scatter(x='sellingPrice',y='zEstimate ratio')
This appears reasonably random, excepting of course that we’ve filtered the homes to those that sold between $2M and$3.5M. It does say quite a bit about the quality of Zillow’s estimates however. Generally the zEstimate ratio is >1 (mean = 1.06) which means Zillow tended to underestimate the value of homes in this area between 2017 and 2020. There are some sizable outliers at 0.5 and 2.1. The standard deviation was 19% which is more than I expected. In fairness to Zillow’s algorithm here, there could be substantial errors in the underlying data on a home on Zillow before the listing agent makes corrections when they list a home.
Zillow reports that their median error (not quite the same as the standard deviation of the error) is in the 2% range however they don’t specify when this accuracy is measured relative to the selling date. Here I used the last zEstimate before the home was listed. I suspect that Zillow is using the last zEstimate before the home is sold by which time they would have benefitted from the entire history of the listing price.
Investopedia also has an interesting post about the accuracy of the zEstimate.
Our map includes 3 zip codes, let’s look at zEstimate ratio by zip
for zc in df.zipcode.unique():
sns.distplot( df.loc[df.zipcode == zc,"zEstimate ratio"], label=zc, hist=False)
plt.legend();
There’s not a lot of difference in these distributions except that homes appear to sell under the zEstimate a bit more in our zip (94062).
Next we looked at the listing ratio, i.e. the ratio of the final selling price to the original listing price. Less than one means the home had to be discounted to sell, greater than one likely means there were multiple competing offers above the list price. Let’s look at listing ratio vs. selling price:
df.plot.scatter(x='sellingPrice',y='listing ratio')
Again pretty random but > 1 on average.
A quasi-normalized measure of home prices is price per square foot (for my international readers a square foot is roughly 1/10th of a square meter). Let’s look at how selling prices varied by living area:
df.plot.scatter(x='area',y='sellingPrice')
The home we were selling was 2963 sf (275 m²). How did that compare with the other homes in the neighborhood?
sns.distplot(df['area'])
Our home was just slightly larger than the average home in the target price range.
Let’s look at the distribution of price per square foot. Here we had to remove some outliers (possibly tiny homes on large lots):
sns.distplot(df.loc[df['price per sf']<2000,'price per sf'])
With COVID raging I wanted to look at how home prices had evolved over time:
for y in df['yearSold'].unique():
sns.distplot(
df.loc[(df.yearSold == y) &
(df['price per sf'] < 2000),'price per sf'], label=y, hist=False)
plt.legend();
Amazingly 2020 was looking pretty good (little did we know then how even more amazing 2021 was going to be)
Newer homes tend to sell for more dollars per square foot — all other factors being equal. While our home was originally built in 1979 it had been so extensively remodeled and added on to over the years we owned it that it bore almost no resemblance to the original house. The most recent, and most extensive remodel was done in 2008 so I had put that in our Zillow entry to see how that impacted their prediction.
df.plot.scatter(x='yearBuilt',y='price per sf');
plt.xlim(1900,2020)
plt.ylim(0,2000);
This chart didn’t show the relationship I was expecting but I knew that older houses tended to be smaller, yet on larger lots so that would inflate their price per square foot.
It’s also clear that our neighborhood was mostly built in the post-war years with another growth spurt in the 80s. This is more obvious in a histogram:
sns.distplot(df.loc[df['yearBuilt']>1900,'yearBuilt']);
Again, we were on a schedule so I wanted to know what to expect in terms of time to sell:
sns.distplot(df['days to sell']);
While some homes had stayed on the market for almost a year most were selling in 20–100 days.
Finally I wanted to look at comparables. When a home’s value is appraised the appraiser will always look at comparable homes in the neighborhood to bracket the possible values of the home being appraised.
I decided to estimate similarity using 5 specific features:
- number of bedrooms
- number of baths
- living area
- The year the house was built
- lot size
Zillow had returned some -1 values for some of these (their version of a NaN?) so I had to filter them out. Then it was a matter of adding our home to the data (it hadn’t sold in the past 3 years so it wasn’t there), normalizing the data for the 5 selected features, then using the euclidian_distances
function from the sklearn
package to compute the euclidian distance between our home and all the other (sold) homes in the data set.
df_norm =
df[['beds','baths','area','yearBuilt','lotSize']].copy().dropna()df_norm.drop(
df_norm.loc[(df_norm.yearBuilt == -1) |
(df_norm.lotSize == -1)].index,inplace=True)my_house = {
'beds': 4,
'baths': 3.5,
'area': 3000,
'yearBuilt': 2008,
'lotSize': 9900
}
df = df.append(pd.DataFrame(my_house,index=[0]))# normalize the datafor col in df_norm.columns:
mx = df_norm[col].max()
mn = df_norm[col].min()
df_norm[col] = (df_norm[col] - mn)/(mx-mn)# convert the dataframe to an array
array_norm = np.array(df_norm)# now pull the house off the end of the dataframe
my_house_norm = array_norm[-1:]
array_norm = array_norm[:-1]
df_norm = df_norm[:-1]# compute the euclidian distances from all the houses to my house
df_norm['dist'] = euclidean_distances(array_norm, my_house_norm)top5_neighbors = df_norm.sort_values(by='dist').head()
Then we could look at the comparables with:
df.iloc[df.index.isin(top5_neighbors.index)]
Which gave us 5 comparable homes that had sold in the past 3 years.
Just focusing on the features used in the matching plus the selling price gives us:
df.iloc[df.index.isin(top5_neighbors.index)]\
[['address','beds','baths','area',
'yearBuilt','lotSize','sellingPrice']]
Epilogue
Laura did a great job of getting the property ready for sale, coordinating painters, landscapers, the photographer, and the furniture staging. However we were selling in the fall of 2020 when the market was a bit slack and so we had to drop the price a couple of times and ended up about 12% under the zEstimate that existed before the home was listed. On the other hand she sold the house in less than 66 days, a bit above her average but well before our end-of-2020 deadline. A confounding factor was that we had previously overestimated the square footage of the house on Zillow by about 100 square feet. Laura corrected this error before listing but that contributed about $100k to the delta between the original zEstimate and the final selling price.
With perfect hindsight we should have waited until the spring of 2021 to sell. Market conditions are much better now, more than enough to make up for the tax hit we would have taken, and we would have benefitted from an additional 6 months of rent payments from our tenants. Then again, last year was full of doom and gloom about people moving out of San Francisco and Silicon Valley to work remotely and no COVID vaccine had yet been approved so prices could just have easily cratered this year.