Digging into COVID-19 Data

Michel Floyd
5 min readApr 5, 2020

I recently came across the New York Times excellent COVID-19 tracking page. As usual the NYT data visualization team has done a tremendous job of making graphics that tell a clear and compelling story. What’s even better is that they are sharing their data via a GitHub repository thus making it easy for mere mortal people like you and me to use it.

I was curious about some metrics that the NYT is not including on their tracking page so I forked their repository and started looking at the data in using jupyter.

Case Fatality Rate

The first thing I was interested in was the case fatality rate. This is defined as number of deaths / number of cases. Unfortunately both of those numbers can be highly inaccurate. This example from Italy illustrates undercounting of COVID-19 deaths in the small town of Nembro near Milan. People can die of COVID-19 without ever being tested for it, or they can die for some other reason because they couldn’t get adequate treatment due to stress on the healthcare system. Problems on the testing side have been well publicized. People who are asymptomatic are rarely tested which means the denominator is usually going to be too small. Even worse is that case fatality should only be computed based on closed cases (death or cure) but it takes time (and more testing) to determine that someone has gotten over the virus. Three of my friends fortunately fall into this category yet none has ever been tested.

With all those caveats in mind, I looked at total deaths attributed to COVID-19 divided by reported cases. Here’s a map of this at the US county level for counties with at least 10 reported cases as of April 4, 2020.

We can see that the highest fatality rates — as high as 25%, are in rural counties in the northwest and southeast. It’s likely that these high fatality rates are due to under testing in those areas. The number are small as well, that big reddish area in northwestern Montana is Toole county which has reported 12 cases and 3 deaths. Overall the US is fortunate to have a quite low cfr (2.7%) compared to some other countries, particularly Italy (12.3%). This is higher than South Korea at 1.7% but South Korea has led the world in testing.

Top and bottom 5 counties by cfr

Population Mortality Rate

Here at least we have less uncertainty in the denominator. Using US Census Estimates for 2018 county populations we can graph the mortality rate:

Top and bottom 5 counties by pmr

What does a population mortality rate of 0.62 represent? That means that for every 1000 people in the county 0.62 died, or more simply 6 people for every 10,000. How does that compare to other causes of mortality? The leading cause of death in the US is heart disease which case a pmr of around 0.2 per year. That 0.62 in Toole County Montana is 3 times as large and has happened in a matter of two weeks. It’s probably an outlier but even the next several counties are in the 0.26 to 0.39 range with larger populations and more significant numbers of cases.

The Case Doubling Rate

A fascinating aspect of this pandemic has been watching people who had little or no interest in math in school suddenly become obsessed with exponential growth and logarithmic plots. I once saw noted futurist Ray Kurzweil give a talk where he explained that most people have a hard time thinking exponentially, they are conditioned to extrapolating only linearly. COVID-19 is truly an exceptional disease, with the number of cases doubling every few days.

The New York Times has used the case doubling rate to show where cases are rising fastest (second map on that page). They also cleverly use spark charts (just below that second map) to highlight the trends in the case doubling rate albeit at the state level with a drill down to county.

I wanted to look at the country as a whole to see where the case doubling rate (cdr) is getting longer (good) or shorter (bad). The following chart compares the cdr for the past 7 days to that for the preceding 7 days. That means we’re only looking at the past two weeks of behavior. The chart only includes counties with at least 10 total cases and at least 2 weeks of data.

As of April 4, 2020

This chart allows us to see where COVID-19 is accelerating. Southern Louisiana stands out, as does the upper midwest but there are many counties along the eastern seaboard where things are also getting worse, faster.

At the state level most states thankfully show a growing cdr. Nebraska only has 327 cases but something is going on there. Things are already bad in Louisiana with over 12,000 cases but it is likely getting worse. Will Louisiana look like New York in a couple weeks?

Same analysis but at the state level

The map above reminds us of how misleading national area maps can be. New York City is only a few pixels on that map yet currently represents about a fifth of all US cases. The US national numbers won’t improve until NYC does. Fortunately the data tells us that the cdr in NYC is getting longer. For the past 7 days it stands at 6.61 days compared to 2.95 for the previous 7 days.

One can only hope that we’ll get to the South Korean stage in the not too distant future where the case doubling rate is now measured in months rather than days.

The Data and the Code

… that was used to make these charts is publicly available on GitHub. If you want to take this further or learn how to use this data please clone or fork this repository.

--

--

Michel Floyd

@michelfloyd Founder cloak.ly, Tahoe resident. Cyclist, skier, sailor, photographer, soccer fan. MIT grad. Hertz Fellow