Picking a Realtor — with DATA. Part 1: Accessing Real-Estate Data with GraphQL
For the bulk of my career in Silicon Valley I lived in a home in the Redwood City hills. The house was not crazy expensive when we bought it in 1988 but three remodels and an Internet revolution later it was worth quite a bit more. We had actually moved out at the end of 2017 and leased it when we moved to Paris for a couple of years. After returning we decided to put it on the market in the middle of the pandemic for tax reasons.
To do that we needed a realtor.
There are several ways people typically find a realtor:
- Call the one that they used last time
- Ask a friend for a reference
- Look at the glossy real estate listings that are available in almost every US neighborhood and pick one who seems to be doing a good job marketing properties
But then I thought, “I have time on my hands and a penchant for going down rabbit holes, why not use data to make a more informed decision?”
I started searching Zillow for houses in my area that had sold recently to see who had sold them. Unfortunately Zillow didn’t appear to provide any structured information on realtor performance.
Time to write a scraper, or search google for one.
Turns out there are many examples of code for scraping Zillow but as always happens with these kinds of projects, none were exactly tailored for the problem at hand. Furthermore Zillow is frequently changing the way its web pages get data from its internal APIs. More specifically Zillow has been transitioning to GraphQL for several years. As a fan of GraphQL this made the project even more interesting.
Zillow doesn’t publish its GraphQL schema (at least not as far as I could find one) so I would have to figure out how their site worked to get the data I needed.
ChromeDevTools is particularly useful for looking at both the contents of a web page (the HTML and the DOM) but also to examine the network traffic between the browser and the underlying site. After a bit of inspection I was able to see that:
When you perform a search in Zillow the search results are returned in an embedded script of type application/json in the body of the page. In the inspector it looks like:
This script object contains all the data that is used by Zillow’s ReactJS front-end to render the list of matching properties (and by properties I’m talking real-estate here, not the properties of objects). It also contains all the metadata for the search itself — i.e. the search specification.
Zillow’s search results are paginated and there is no infinite scroll. Results appear to be currently limited to 20 pages at 40 properties per page.
The data in the JSON contains a bunch of information on each house in the list but basically only what is needed to display in the search listings UI.
To get what I really wanted I would have to look at the detail page for each property. That turns out to be fetched from Zillows /graphql
API. When I was working on this project last summer the Zillow web application was still sending fully formed GraphQL queries to their API but more recently they have switched to persisted queries so you can’t see (and more importantly, can’t change) the structure of the queries. However their GraphQL API still supports some older ad-hoc GraphQL queries. Looking at the results of the persisted GraphQL queries provides a lot of information about Zillow’s underlying GraphQL schema which can then be used to fetch data on a specific property.
The beauty of looking at GraphQL data is that it is pure data without any HTML markup. This makes scraping a whole lot easier than trying to parse HTML to find specific classes and IDs that end up changing all the time.
Code
Since I planned to use Pandas for the analysis it made sense to also write the scraper in python. The following code is all available on GitHub. (Best to get the code there rather than copying from this post as Medium tends to muck with indentation which is essential to python)
First import all the required libraries and define some small utility functions:
from lxml import html
import requests
import json
from datetime import datetime
from random import randint
from time import sleep
import pandas as pd
def clean(text):
# clean 'json as comment' returned from zillow search
if text:
return ' '.join(' '.join(text).split())
return None
def save_json(data,fname):
with open(fname, 'w') as outfile:
json.dump(data, outfile)
def appendScalarsFromDict(a,b):
"""
Copies all the scalar keys (strings, ints, floats, boolean, datetimes) from the dictionary a to the dictionary b (This will override existing keys in dictionary b)
"""
for key in a:
if isinstance(a[key],(str,int,float,bool,datetime)):
b[key] = a[key]
def pretty_print_POST(req):
"""
Pretty Print a POST request
Parameters:
req the POST request built using requests.Request and prepared with .prepare()
Pay attention at the formatting used in this function because it is programmed
to be pretty printed and may differ from the actual request.
"""
print('{}\n{}\r\n{}\r\n\r\n{}'.format(
'-----------START-----------',
req.method + ' ' + req.url,
'\r\n'.join('{}: {}'.format(k, v) for k, v in req.headers.items()),
req.body,
))
def insert_page_number(searchUrl,pageNumber):
"""
Parameters:
searchUrl the original search URL
pageNumber an integer in [1,totalPages]
Returns a paginated url based on the initial searchUrl which has a default page 1"""
p = 'pagination%22%3A%7B%22currentPage%22%3A{}%7D%2C%22'.format(pageNumber)
return searchUrl.replace('pagination%22%3A%7B%7D%2C%22',p)
Next we’ll define the many functions we need to scrape the results, moving from the top-down.
def scrape(url):
"""
starting with a Zillow search URL, traverse each page
and scrape the results
Parameters:
url search url composed through the Zillow web UI\
Returns
a pandas dataframe of results
"""
response = get_response(url)if not response:
print("Failed to fetch the page, please check `response.html` to see the response received from zillow.com.")
return Noneresults, totalPages, cookies = clean_results(response)
print('Total pages: ',totalPages)
df = get_df_from_json(results)
prevUrl = url
for p in range(2, totalPages+1):
nextUrl = insert_page_number(url, p)
response = get_response(nextUrl, cookies=cookies, referer=prevUrl)
results, _, cookies = clean_results(response)
df = df.append(get_df_from_json(results),ignore_index=True)
prevUrl = nextUrl
return df
The scrape
function starts with a Zillow search URL which you can copy from the browser URL bar when you perform an interactive search on Zillow. For example:
searchUrl = "https://www.zillow.com/homes/recently_sold/house_type/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22Emerald%20Lake%20Hills%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-122.28363290193751%2C%22east%22%3A-122.24844231966212%2C%22south%22%3A37.4409911163216%2C%22north%22%3A37.48306057333994%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22pmf%22%3A%7B%22value%22%3Afalse%7D%2C%22pf%22%3A%7B%22value%22%3Afalse%7D%2C%22rs%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22con%22%3A%7B%22value%22%3Afalse%7D%2C%22mf%22%3A%7B%22value%22%3Afalse%7D%2C%22manu%22%3A%7B%22value%22%3Afalse%7D%2C%22land%22%3A%7B%22value%22%3Afalse%7D%2C%22tow%22%3A%7B%22value%22%3Afalse%7D%2C%22apa%22%3A%7B%22value%22%3Afalse%7D%2C%22apco%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A15%7D"
Corresponds to a search for sold single-family homes in a particular rectangle as shown on the map below:
Adding additional search criteria will cause them to be saved in the search URL. This is nice because you don’t have to construct a search URL by hand, the Zillow UI takes care of it for you.
Our scrape
function is basically a loop around three other functions.
get_response
does a simple HTTP GET
on a URI, wrapping it with a retry in case of failure.
def get_response(uri, retries=5, cookies=None, referer=None):
"""
Perform a GET request and return the response
Parameters:
uri uri to GET
retries number of times to retry
Returns: an HTTP response object
"""
for i in range(retries): # try up to 5 times
sleep(randint(5,10))
response = requests.get(uri, headers=get_headers(referer), cookies=cookies)
print(response.status_code,'GET ',uri[:uri.find('?')],'…')
if not response.ok:
save_to_file(response) # for debugging
continue # retry
else:
return response
return None
get_response
returns the entire content of a Zillow search page (which is HTML) but does not resolve any internal references to other objects, that’s not needed.
clean_results
takes the HTML from get_response
, extracts the JSON with the search results and determines the number of total pages.
def clean_results(response):
"""
Extract the total number of pages from a Zillow search page
along with the json object for the results and any cookies set
Parameters:
response HTTP response object
Returns:
listResults json of results (an array, usually up to 40 results)
totalPages the number of total pages returned by the search
cookies the cookies from the response
"""
parser = html.fromstring(response.text)
jblob = parser.xpath('//script[@data-zrr-shared-data-key="mobileSearchPageStore"]//text()')
cleaned_data = clean(jblob).replace('<!--', "").replace("-->", "")
try:
json_data = json.loads(cleaned_data)
save_json(json_data,'cleaned_data.json')
totalPages = json_data.get('cat1').get('searchList').get('totalPages')
listResults = json_data.get('cat1').get('searchResults').get('listResults', [])
return listResults, totalPages, response.cookies
except ValueError:
print("Invalid json")
return None
Once we have valid JSON for the search results we need to iterate over them and grab the details for each property:
def get_df_from_json(search_results):
"""
Iterate over the JSON containing the search results to create a dataframe
Parameters:
search_results: json of search results
Returns: pandas dataframe of Zillow properties including specific fields
"""
properties_list = []
for i,properties in enumerate(search_results):
data = {} zpid = properties.get('zpid')
prop = post_qql_query(zpid,PriceTaxQuery)
appendScalarsFromDict(prop,data)
data['dateSold'] = datetime.fromtimestamp(data['dateSold']/1000)# promote address sub-properties
appendScalarsFromDict(prop['address'],data)
# extract recent sales history and flatten appendScalarsFromDict(get_listing_info(prop['priceHistory'], data['dateSold']), data)
# extract zEstimate right before the listing and flatten
if 'listedAt' in data:
appendScalarsFromDict(get_zEstimate_at_listing(prop['homeValueChartData'], data['listedAt'], data['dateSold']), data)
properties_list.append(data)
return pd.DataFrame(properties_list)
In going through the search results, the only key of interest is zpid
, the Zillow Property ID. That zpid
is the main parameter for the PriceTaxQuery
which is in GraphQL:
def PriceTaxQuery(zpid, clientVersion="home-details/6.0.11.1315.master.2fc8ca5", timePeriod="FIVE_YEARS", metricType="LOCAL_HOME_VALUES", forecast=True):
return {
"query": """
query PriceTaxQuery($zpid: ID!, $metricType: HomeValueChartMetricType, $timePeriod: HomeValueChartTimePeriod) {
property(zpid: $zpid) {
address {
city
community
neighborhood
state
streetAddress
subdivision
zipcode
}
bathrooms
bedrooms
countyFIPS
dateSold
hdpUrl
homeStatus
homeValueChartData(metricType: $metricType, timePeriod: $timePeriod) {
points {
x
y
}
name
}
lastSoldPrice
latitude
livingArea
livingAreaUnits
longitude
lotSize
parcelId
price
priceHistory {
time
price
event
buyerAgent {
profileUrl
name
}
sellerAgent {
profileUrl
name
}
}
yearBuilt
zestimate
zpid
}
}
""",
"operationName": "PriceTaxQuery",
"variables": {
"zpid": zpid,
"timePeriod": timePeriod,
"metricType": metricType,
"forecast": forecast
},
"clientVersion": clientVersion
}
For those of you who may have fallen asleep, WAKE UP! This is the interesting part. The PriceTaxQuery
function returns a request object which is going to be POSTed to Zillow’s /graphql
API. It’s named PriceTaxQuery because that’s one of the queries that Zillow’s API allows to express the GraphQL query as text directly instead of referring to a secret query on their server by name. The query above was created by looking at the results of several GraphQL queries being run on a Zillow property detail page.
Note that if you make even a single error in the above GraphQL Zillow will error with a 400 without any additional information as to why the query failed. It’s easy enough to remove keys from the schema but adding them means knowing the exact spelling of their names (and GraphQL is case-sensitive) as well as the way they are nested in the schema. Also note that certain keys may exist in multiple places, for example city
and state
exist both at the root level of a property
as well as under the address
level.
The specific information I was after was the name of the realtor that sold the property as well as key price information around the sale — what was the zEstimate (Zillow’s predicted price for a property), what was the listing price, did the listing price change over time, was the house listed more than once before being sold, how long was the house on the market. Much of this information comes from the priceHistory
part of the schema. The history of the zEstimate comes from the homeValueChartData
part of the schema which is what Zillow uses to draw the price history chart on a property detail page.
post_gql_query
is a function that runs a particular query against Zillow’s /graphql
API. It just uses the requests
library to do this. For convenience it strips out the top two levels of the JSON hierarchy that is returned since we don’t need it.
def post_qql_query(zpid,query,gql_server = 'https://www.zillow.com/graphql/'):
"""
Run the PriceTaxQuery query against Zillow's GraphQL server
Parameters:
zpid Zillow property ID
Returns: json object including all the data requested by the query
while omitting the top two levels of hierarchy (data.property)
"""sleep(randint(1,4)) # act human
r = requests.post(
gql_server,
json=query(zpid),
headers=get_gql_headers(zpid),
params={'zpid': zpid}
)print(r.status_code,'POST ',gql_server,zpid)if r.ok:
try:
return r.json().get('data').get('property')
except:
print('Error getting json from graphql query results')
return None
else:
printErrors(r)
return None
We need to send some reasonable headers with our query so for this we mimic what happens when Zillow makes a request from the browser:
def get_gql_headers(zpid):
"""
Return the GraphQL headers required to query Zillow's GraphQL server
Parameters:
zpid the Zillow property ID
Returns: a JSON POST header object
"""
# copied from a qraphQL request to zillow using Chrome
return {
'authority': 'www.zillow.com',
'method': 'POST',
'path': '/graphql/?zpid={}'.format(zpid),
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en',
'content-type': 'text/plain',
'dnt': '1',
'origin': 'https://www.zillow.com',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
After running the GraphQL query for a specific query we first copy all the scalar keys from the JSON (strings, ints, floats, booleans, and dates) then we use a couple of other functions to summarize the price history.
get_listing_info
looks at the priceHistory
object to summarize who was the listing agent, the selling agent, the date when the home was first listed, the price when it was first listed, and the number of times it was re-listed. Fortunately Zillow returns this array in reverse chronological order.
def get_listing_info(priceHistory,dateSold):
"""
Get the price history for the property and extract key pieces of data
Parameters:
priceHistory: array of objects containing pricing events
Returns an object with keys:
sellerAgent
sellerAgentUrl
listedAt
firstListingPrice
"""
sellerAgent = sellerAgentUrl = buyerAgent = buyerAgentUrl = listedAt = firstListingPrice = None
foundSoldEvent = False
timesListed = 0# loop through the price history until a sold event is found (typically the first event)
for ev in priceHistory:
eventTime = datetime.fromtimestamp(ev.get('time')/1000)
if (not foundSoldEvent) & (ev.get('event') == 'Sold'):
foundSoldEvent = True# get the seller's agent info
sa = ev.get('sellerAgent')
if sa:
sellerAgent = sa.get('name')
sellerAgentUrl = 'https://www.zillow.com' + sa.get('profileUrl')# get the buyer's agent info
ba = ev.get('buyerAgent')
if ba:
buyerAgent = ba.get('name')
buyerAgentUrl = 'https://www.zillow.com' + ba.get('profileUrl')if foundSoldEvent & (ev.get('event') == 'Listed for sale'):
# keep looking back at listing events for up to 9 months to try to find the earliest one matching the sale
if (dateSold - eventTime).days < 9*30:
timesListed += 1
listedAt = eventTime
firstListingPrice = ev.get('price')
return {
"sellerAgent": sellerAgent,
"sellerAgentUrl": sellerAgentUrl,
"buyerAgent": buyerAgent,
"buyerAgentUrl": buyerAgentUrl,
"listedAt": listedAt,
"firstListingPrice": firstListingPrice,
"timesListed":timesListed
}
It turns out that Zillow’s zEstimate changes quite a bit when a home is listed for sale. Zillow appears to interpret this event as a strong pricing signal. Just prior to sale however, Zillow doesn’t really “know” that a home is about to be listed so its estimate of price at that point “should” only reflect market factors, what is known about the home, and the quality of the prediction algorithm.
Before a home is sold, the listing agent will usually update its listing on Zillow, ensuring that it is accurate. Professional pictures will be posted and the home’s description will be written to maximize its appeal. Zillow has to deal with potentially many changes to the core data it has on a home at that point along with noticing that the home has been listed for sale at a specific price. The result, in our experience, was a couple weeks of wide swings in the zEstimate for our home.
get_zestimate_at_listing
looks through the history of the zEstimate to find the zEstimate that had been published just prior to listing.
def get_zEstimate_at_listing(homeValueChartData,listedAt,dateSold):
"""
Get the history of zEstimates for this property and pick out the one that
is just prior to the listing date (if it exists) or the sell date
Parameters:
homeValueChartData: array of x,y points corresponding to time and zEstimatezpid zillow property id
listedAt listing date (can be null)
dateSold (should not be null)
Returns an object with keys:
zEstimate_at_listing the zEstimate at the earliest of (listing date,dateSold) or None
zEstimate_date the date corresponding to the zEstimate_at_listing
"""first_listing_date = zEstimate_at_listing = zEstimate_date = Noneif isinstance(dateSold,datetime) & isinstance(listedAt,datetime):
first_listing_date = min([dateSold,listedAt])
elif isinstance(dateSold,datetime):
first_listing_date = dateSold
elif isinstance(listedAt,datetime):
first_listing_date = listedAt
if first_listing_date:
for el in homeValueChartData:
if el.get('name') == "This home":
points = el.get('points')
# points are in ascending time order
# loop until we've exceeded the first_listing_date
# keep the last values before we go past the first_listing_date
for p in points:
x = datetime.fromtimestamp(p.get('x')/1000)
if x < first_listing_date:
zEstimate_date = x
zEstimate_at_listing = p.get('y')
else:
break
return {
"zEstimate_at_listing": zEstimate_at_listing,
"zEstimate_date": zEstimate_date
}
The results can then be fetched with simply:
df = scrape(searchURL)
The results will be saved in a Pandas dataframe that can be analyzed in a Jupyter notebook. I’ll cover this in part 2 and show how I went about finding a realtor.