Covid-19 Data Analysis March 23rd

The novel coronavirus is bringing societies worldwide to a halt and causing widespread health impacts. Still, there appears to be a paucity of information how the crisis is likely to unfold in the near future. Most news outlets only show aggreate numbers, color-coded world maps, or descriptive statistics of the current situation.

This workbook is part of a daily series, in which I am taking a closer look at numbers and trends, and making some tentative projections. All data are taken from the Johns Hopkins Covid-19 dataset on GitHub. Each day, Johns Hopkins update the dataset with figures for the previous day. Numbers are pulled live when running the workbook.

All projections are based on curve fitting on a per-country level. Thus, they extrapolate current trends. By nature, they can neither reflect the impact of recently enacted measures, which are expected to reduce infection rates, nor can they reflect new adverse developments. So please take everything with a grain of salt.

Please stay safe, look after your loved ones, and stay healthy.

In [1]:
# Hide raw notebook code in exported HTML
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show) {
   $('div.input').hide();
 } else {
   $('div.input').show();
 }
 code_show = !code_show
} 
$(document).ready(code_toggle);
</script>
<form"><input type="button" value="Show/hide code" onclick="code_toggle()"></input></form>
''')
Out[1]:

Global Overview

As of March 22st end of day, there were 336k confirmed cases of Covid-19 globally, and this figure continues to grow at 10% per day, down from 12% the previous day.

15k people have died from this illness. At 12.6% per day, down from 15%, deaths continue to grow growing faster than new cases

From the data, recoveries appear to grow more slowly, at 7% daily (up from 5% yesterday). This may be due to the fact that recoveries take longer on a per-case basis; it may also be that many countries track new cases and deaths by legal obligation, but are not tracking recoveries as rigorously; particularly for mild cases.

In [2]:
# Import required libraries
import pandas as pd
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

import numpy  as np
from scipy.optimize import curve_fit
from scipy.special import expit

import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
import matplotlib.dates as mdates

import datetime as datetime
import operator
import sys

# Load data and group by country/region
def loadAndGroup(fileName, groupBy="Country/Region"):
    df=pd.read_csv(fileName)
    df=df.groupby(groupBy).sum()
    return df
    
# Load data
confd =loadAndGroup('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv')
recovd=loadAndGroup('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv')
deaths=loadAndGroup('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv')
population=loadAndGroup('./population.csv','Country')
today =confd.columns[-1]

# Calculate new, previous and total as well as growth rate from time series
def newPrevTotalGrowth(df):
    df1        =df.loc[:, df.columns[-2]:]
    prev, total=df1.sum(axis=0)
    delta      =total-prev
    return delta, prev, total, delta/prev if prev!=0 else 0

# Calculate overall data sets for overview plots
confdNew,  confdPrev,  confdTotal,  confdGrowth =newPrevTotalGrowth(confd)
recovdNew, recovdPrev, recovdTotal, recovdGrowth=newPrevTotalGrowth(recovd)
deathsNew, deathsPrev, deathsTotal, deathsGrowth=newPrevTotalGrowth(deaths)

closedNew, closedPrev, closedTotal=recovdNew+deathsNew, recovdPrev+deathsPrev, recovdTotal+deathsTotal
closedGrowth=closedNew/closedPrev if closedPrev!=0 else 0
activeNew, activePrev, activeTotal=confdNew-closedNew, confdPrev-closedPrev, confdTotal-closedTotal
activeGrowth=activeNew/activePrev if activePrev!=0 else 0

labels=['Confirmed', 'Active', 'Closed', 'Recovered', 'Deaths']
valuesTotal=[confdTotal, activeTotal, closedTotal, recovdTotal, deathsTotal]
valuesNew=[confdNew, activeNew, closedNew, recovdNew, deathsNew]
valuesPrev=[confdPrev, activePrev, closedPrev, recovdPrev, deathsPrev]
valuesPerc=[1, activeTotal/confdTotal, closedTotal/confdTotal, recovdTotal/confdTotal, deathsTotal/confdTotal]
valuesGrowth=[confdGrowth, activeGrowth, closedGrowth, recovdGrowth, deathsGrowth]

# Prepare overview plot
plt.rcParams['figure.figsize'] = [15, 5]
fig, ax=plt.subplots(nrows=1, ncols=3, constrained_layout=True)
fig.suptitle('Covid-19 Global Overview as of %s' % today, fontsize="16")

# Human readable formatting for figures ranging in single digits, thousands, millions and billions
def humanReadable(x):
    if x<0:
        return "-"+humanReadable(-x)
    formats=[ (10000000000, 1000000000,"%.0fb"), (1000000000,1000000000, "%.1fb"), (10000000, 1000000, "%.0fm"), 
             (1000000, 1000000, "%.1fm"), (10000,1000,"%.0fk"), (1000,1000,"%.1fk") ]
    for threshold, divider, formatString in formats:
        if x>=threshold:
            return formatString % (x/divider)
    return "%d" % x

@plticker.FuncFormatter
def hrFormatter(x, pos):
    return humanReadable(x)

# Left side: case counts
ax[0].set_title('Absolute case counts')
ax[0].get_yaxis().set_visible(False)
ax2 = ax[0].twinx()
ax2.invert_yaxis()
ax2.invert_xaxis()
ax2.xaxis.set_major_formatter(hrFormatter)
ax2.barh(labels, valuesTotal)
for i, v in enumerate(valuesTotal):
    ax2.text(v, i, humanReadable(v), ha='right', va='center', color='k', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'), backgroundcolor='w')

# Middle: Growth rates
ax[1].set_title('Daily growth rates')
ax[1].invert_yaxis()
ax[1].get_yaxis().set_visible(False)
ax[1].xaxis.set_major_formatter(plticker.PercentFormatter(1.0))
ax[1].barh(labels, valuesGrowth)
for i, v in enumerate(zip(valuesGrowth, valuesNew)):
    ax[1].text(v[0]/2, i, "+"+humanReadable(v[1]), ha='center', va='center', color='w')
for i, v in enumerate(valuesGrowth):
    ax[1].text(v, i, " %.1f%%" % (v*100), ha='left', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'), backgroundcolor='w')
    
# Middle: Percentages
ax[2].set_title('Share of confirmed cases')
ax[2].invert_yaxis()
ax[2].get_yaxis().set_visible(False)
ax[2].xaxis.set_major_formatter(plticker.PercentFormatter(1.0))
ax[2].barh(labels, valuesPerc)
for i, v in enumerate(valuesPerc):
    ax[2].text(v, i, " %.1f%%" % (v*100), ha='left', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'), backgroundcolor='w')
        

Global Time Series

On a global level, exponential growth appears to continue unabated. The first exponential growth wave in China plateaued in mid-February. Since then, the global spread of the disease has launched a second exponential growth wave of significantly higher magnitude.

In [3]:
# Aggregate time series data on global level
globalConfd=confd.loc[:, confd.columns[2]:].sum(axis=0)
globalRecovd=recovd.loc[:, confd.columns[2]:].sum(axis=0)
globalDeaths=deaths.loc[:, confd.columns[2]:].sum(axis=0)
globalClosed=globalRecovd+globalDeaths
globalActive=globalConfd-globalClosed
globalDF=pd.DataFrame({
    'Deaths':globalDeaths, 
    'Active':globalActive,
    'Recovered':globalRecovd
})

# Prepare stacked area plot
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(1, 1, 1)
colors=['r', 'orange', 'g']
globalDF.plot(ax=ax, kind='area', stacked=True, color=colors)
ax.set_title("Global Cases over Time, as of %s" % today, fontsize=16)
ax.set_ylabel("Number of cases")
ax.yaxis.set_major_formatter(hrFormatter)
ax.get_legend().remove()

# Plot and label the values
xPos=len(globalConfd.index)-1
x=globalConfd.index[xPos]
c, r, d, a, cl=globalConfd[x], globalRecovd[x], globalDeaths[x], globalActive[x], globalClosed[x]
ax.text(xPos, d/2, " %s deaths" %  humanReadable(d), ha='left', va='center', c='r')
ax.text(xPos, d+a/2, " %s active" %  humanReadable(a), ha='left', va='center', c='orange')
ax.text(xPos, d+a+r/2, " %s recovered" %  humanReadable(r), ha='left', va='center', c='g')
ax.text(xPos, c, " %s confirmed" % humanReadable(c), ha='left', va='center', c='k')
plt.show()

There might be one ray of light: absolute daily case growth on March 20th-22nd appears to be plateauing. This may be an early indication that social distancing measures are beginning to work. It might also be due to measurement noise, e.g. incomplete data capture due to the weekend. Prior dates like March 12th and 13th have also seen dips in growth rate followed by corrections the subsequent day.

In [4]:
# Calculate daily deltas
globalConfdDelta=globalConfd.diff()
globalRecovdDelta=-globalRecovd.diff()
globalDeathsDelta=-globalDeaths.diff()
globalDFDelta=pd.DataFrame({
    'DeathsDelta':globalDeathsDelta, 
    'RecoveredDelta':globalRecovdDelta,
    'ConfirmedDelta':globalConfdDelta
})

# Prepare stacked area plot
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(1, 1, 1)
colors=['r', 'g', 'orange' ]
globalDFDelta.plot(ax=ax, kind='area', stacked=True, color=colors)
ax.set_title("Daily change in global cases over Time, as of %s" % today, fontsize=16)
ax.set_ylabel("Growth in number of cases")
ax.yaxis.set_major_formatter(hrFormatter)
ax.get_legend().remove()

# Plot and label the values
xPos=len(globalConfd.index)-1
x=globalConfd.index[xPos]
c, r, d=globalConfdDelta[x], globalRecovdDelta[x], globalDeathsDelta[x]
ax.text(xPos, d, " %s new deaths" % humanReadable(d), ha='left', va='center', c='r')
ax.text(xPos, r+d, " %s new recovered" % humanReadable(r), ha='left', va='center', c='g')
ax.text(xPos, c, " %s new confirmed" % humanReadable(c), ha='left', va='center', c='orange')
plt.show()

This seems to be supported by percentage daily case growth trends for two days in a row. Fingers crossed that this pattern holds.

In [5]:
# Calculate daily deltas
globalConfdDeltaPerc=globalConfdDelta/globalConfd.shift(periods=1)
globalRecovdDeltaPerc=-globalRecovdDelta/globalRecovd.shift(periods=1)
globalDeathsDeltaPerc=-globalDeathsDelta/globalDeaths.shift(periods=1)
globalDFDeltaPerc=pd.DataFrame({
    'DeathsDeltaPerc':globalDeathsDeltaPerc, 
    'RecoveredDeltaPerc':globalRecovdDeltaPerc,
    'ConfirmedDeltaPerc':globalConfdDeltaPerc
})

# Prepare stacked area plot
fig = plt.figure(figsize=(15, 8))
ax = fig.add_subplot(1, 1, 1)
colors=['r', 'g', 'orange' ]
globalDFDeltaPerc.plot(ax=ax, kind='line', stacked=False, color=colors)
ax.set_title("Daily Percentage Change in Global Cases over Time, as of %s" % today, fontsize=16)
ax.yaxis.set_major_formatter(plticker.PercentFormatter(1.0))
ax.get_legend().remove()

# Plot and label the values
xPos=len(globalConfd.index)-1
x=globalConfd.index[xPos]
c, r, d=globalConfdDeltaPerc[x], globalRecovdDeltaPerc[x], globalDeathsDeltaPerc[x]
ax.text(xPos, d, " %.1f%% new deaths" % (d*100), ha='left', va='center', c='r')
ax.text(xPos, r, " %.1f%% new recovered" % (r*100), ha='left', va='center', c='g')
ax.text(xPos, c, " %.1f%% new confirmed" % (c*100), ha='left', va='center', c='orange')
plt.show()

Country Overview

China continues to have the most cases, however, case growth in China appears no longer material. The number of cases in Italy, the US, Spain and Germany exceeds Iran. Case numbers in the US continues to rise rapidly, with the country continuing to rank 3rd most impacted worldwide.

Among larger western countries, the US continues to stand out with a 30.5% growth rate (down from 33%). Growth rates in Germany have declined to 12% from 30% two days ago. If this is not entirely due to incomplete reporting over the weekend, it may may be an early indication that heavier social distancing measures including school and university closures are beginning to have impact.

In the same group, Italy and Switzerland continue to have the highest number of cases per capita, a proxy for the highest infection risk from social interaction. Spain has the next highest case incidence per capita.

Case fatality rates (CFR), measured as number of deaths per number of confirmed cases, remain highest in Italy with 9% (slightly up) and Iran with 8%, possibly due to overwhelmed health systems. Spain stands at 6%. China, France, the UK, the Netherlands and Japan are in the 3.5-5% range. CFR in Germany remains at 0.4%, up from 0.2% two days ago.

In [6]:
def lastGrowth(data):
    if len(data)>=2 and data[-2]>0:
        return data[-1]/data[-2]-1
    return 0    

# Calculate summary stats for all countries with number of cases above the threshold
threshold=100
validCountries=[]
for name in confd.index.values:
    conf1=confd.loc[name, confd.columns[2]:] # select row for given country, filter out lat/lon columns
    confv=conf1.to_numpy()
    if confv[-1]>=threshold:
        recovd1=recovd.loc[name, confd.columns[2]:]
        recovdv=recovd1.to_numpy()

        death1=deaths.loc[name, confd.columns[2]:]
        deathv=death1.to_numpy()

        summary={
            'Name':name,
            'Cases':confv[-1],
            'CaseGrowth':lastGrowth(confv),
            'Recovered':recovdv[-1],
            'RecoveredGrowth':lastGrowth(recovdv),
            'Deaths':deathv[-1],
            'DeathsGrowth':lastGrowth(deathv),
            'CFRTotal':deathv[-1]/confv[-1],
            'CFROutcome':deathv[-1]/(deathv[-1] + recovdv[-1]) if deathv[-1]+recovdv[-1]>0 else 0,
        }
        validCountries.append(summary)

# Prepare sorted stats        
validCountries.sort(key=lambda x: x['Cases'], reverse=True)

# Prepare data for plots
countryNames=[x['Name'] for x in validCountries]
countryCases=[x['Cases'] for x in validCountries]
countryGrowth=[x['CaseGrowth'] for x in validCountries]
countryCFRTotal=[x['CFRTotal'] for x in validCountries]
countryCFROutcome=[x['CFROutcome'] for x in validCountries]
countryPop1000=[population.loc[x]['PopulationInThousands'] for x in countryNames]
countryCasesPerPop1000=[cases/pop if pop!=0 else 0 for cases, pop in zip(countryCases, countryPop1000)]

# Prepare overview plot
plt.rcParams['figure.figsize'] = [15, 20]
fig, ax=plt.subplots(nrows=1, ncols=4, constrained_layout=True)
fig.suptitle('Covid-19 Country Overview as of %s' % today, fontsize="16")

# Left hand side: Plot lastest confirmed cases by country
ax[0].set_xscale('log')
ax[0].set_title('Confirmed cases (log scale)')
ax[0].get_yaxis().set_visible(False)
ax2 = ax[0].twinx()
ax2.invert_yaxis()
ax2.invert_xaxis()
ax2.xaxis.set_major_formatter(hrFormatter)
ax2.barh(countryNames, countryCases)
for i, v in enumerate(countryCases):
    ax2.text(v, i, "%s " % humanReadable(v), ha='right', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'))

# Middle left: Plot latest growth rate by country
ax[1].set_title('Daily case growth rate')
ax[1].invert_yaxis()
ax[1].get_yaxis().set_visible(False)
ax[1].xaxis.set_major_formatter(plticker.PercentFormatter(1.0))
ax[1].barh(countryNames, countryGrowth)
for i, v in enumerate(countryGrowth):
    ax[1].text(v, i, " %.1f%%" % (v*100), ha='left', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'))

# Middle right: Plot cases per 1000 population
ax[2].set_title('Cases per 100k inhabitants')
ax[2].invert_yaxis()
ax[2].get_yaxis().set_visible(False)
ax[2].xaxis.set_major_formatter(plticker.PercentFormatter(1.0))
ax[2].barh(countryNames, countryCasesPerPop1000)
for i, v in enumerate(countryCasesPerPop1000):
    ax[2].text(v, i, " %.1f" % (v*100), ha='left', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'))

line =ax[1].axvline(x=confdGrowth, ymin=0, ymax=len(countryNames), ls="--")
ax[1].text(confdGrowth, -1, " \u2300 %.1f%%" % (confdGrowth*100), ha='left', va='center', color=line.get_color())

# Right: Plot CFR by country
ax[3].set_title('Case fatality rate')
ax[3].invert_yaxis()
ax[3].get_yaxis().set_visible(False)
ax[3].xaxis.set_major_formatter(plticker.PercentFormatter(1.0))
ax[3].axvline(x=0, ymin=0, ymax=len(countryNames), color='k', ls='-', lw='.8')
ax[3].barh(countryNames, countryCFRTotal)
for i, v in enumerate(countryCFRTotal):
    if v!=0:
        ax[3].text(v, i, " %.1f%% " % (v*100), ha='left', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'))

plt.show()

Case fatality rate (CFR) is calculated as deaths/total confirmed cases. This is likely to be an understatement of the true value, as the outcome of currently active cases is still open. Alternative CFR measures like deaths/(deaths+recovered) lead to outlier results for many countries, as recovery appears to take longer and data capture about recovery tends to be less strict than about deaths. I have therefore removed that indicator.

Growth of Confirmed Cases

The chart shows the growth of confirmed cases by country. To developments more comparable across countries, each country curve is shifted in time so that day 0 corresponds to the moment when that country exceeds a threshold number of confirmed cases (here=100).

From reported numbers, containment and mitigation appear to be most effective in China, South Korea, Singapore and Japan. Iran appears to be flattening. Norway and Sweden also appear to be growing less rapidly. Most major western countries still continue to follow the frightening trajectory of Italy.

In [7]:
# Retrieve data for one country from the data frame
def getData(df, name):
    df1   =df.loc[name, df.columns[2]:]
    days  =df1.index.values
    days  =[datetime.datetime.strptime(d,"%m/%d/%y").date() for d in days]
    daynr =range(1, len(days)+1)
    values=df1.to_numpy()
    return days, daynr, values

# Crop away starting values < n from the arrays
def startAtN(values, n):
    while(len(values)>0 and values[0]<n):
        values=values[1:]
    return values

fig = plt.figure(figsize=(15, 10))
ax = fig.add_subplot(1, 1, 1)

threshold=100
ax.set_title("Confirmed cases by country, as of %s" % today, fontsize=16)
ax.set_xlabel("Days since country reached %s cases" % humanReadable(threshold))
ax.set_ylabel("Confirmed cases (log scale)")
ax.set_yscale('log')
ax.yaxis.set_major_formatter(hrFormatter)

# Plot individual country curves
maxLen=0
for i, cty in enumerate(validCountries, start=0):
    name=cty['Name']
    days, daynr, values=getData(confd, name)
    shiftedValues=startAtN(values, threshold)
    if(len(shiftedValues)>maxLen):
        maxLen=len(shiftedValues)
    line, =ax.plot(range(0, len(shiftedValues)), shiftedValues)
    c=line.get_color()
    ax.text(len(shiftedValues)-1, shiftedValues[-1], name, color=c)

    
# Plot an exponential growth line with given start value and factor
def plotFactor(ax, start,factor,length):
    xs=range(0, length)
    ys=[start]
    while(len(ys)<length):
        ys.append(factor*ys[-1])

    line, =ax.plot(xs, ys,"--k")
    c=line.get_color()
    ax.text(xs[-1], ys[-1], "%.1f%% daily growth" % (factor*100-100), color=c)

# Plot the average growth line
plotFactor(ax,threshold, 1+confdGrowth, maxLen//2)
   
plt.show()

Projection of Most Impacted Countries

To project current trends, we try and fit both an exponential function, which grows without bound, and a sigmoid function, which flattens out again. We then select the curve with the lowest mean square error. Such projections should be taken with a grain of salt. The difference between exponential and sigmoid growth can be substantial over a few week's time. So slight differences in measurement noise may lead to significantly different outcomes.

This estimator has switched the US from exponential to sigmoid growth today, resulting in large reductions for outcomes in future weeks. It remains too early to say whether this pattern holds. I have added confidence ranges to the detailed country charts in the next section to reflect this.

If ongoing mitigations do not materially change the course of the disease, one week from now, the US would become the most impacted country, with upwards of 115k cases (down from 270k), followed by Italy with 95k (up from 90k), and then China which continues to remain largely flat around 80k. Two weeks from now, the US and Italy would continue to be #1 and #2 most impacted, but with more subdued case growth than in yesterday's prediction. The UK and Switzerland, yesterday projected to overtake Italy, would remain middle of the pack.

Fingers crossed that social distancing is going to further reduce infection rates strongly.

In [39]:
# Exponential model for fitting
def fitExp(x, a, b):
    return a*np.exp(b*x)

# Sigmoid model for fitting
def fitSig(x, a, b, c):
    return a/(1.0+np.exp(-b*x - c))

def s3(x):
    return np.around(x,decimals=3)

def cropDuplicateLeadingZeros(daynr, values):
    while(len(daynr)>1 and values[1]<=0):
        daynr, values=daynr[1:], values[1:]
    return daynr, values

def extendTime(days, daynr, extraDays):
    for i in range(0, extraDays):
        days=np.append(days,days[-1]+datetime.timedelta(days=1))
        daynr=np.append(daynr, daynr[-1]+1)
    return days, daynr

# Fit given curve to the given data
def fitCurve(cropDaynr, cropValues, fitFunc, p0, eqFormatter):
    try:
        # fit curve
        popt, pcov=curve_fit(fitFunc, cropDaynr, cropValues, p0)
    
        # estimate error
        cropProj=fitFunc(cropDaynr, *popt)   # estimate error
        sqdiff=np.sum((cropValues-cropProj)**2)
        
        # generate label for chart
        equation=eqFormatter(popt) 
        if(len(cropProj)>=2 and cropProj[len(cropProj)-2]!=0):
            growthRate=cropProj[len(cropProj)-1]/cropProj[len(cropProj)-2]-1
        else:
            growthRate=0
        fitLabel="%s\n%.2fx daily growth" % (equation, growthRate)
        return popt, pcov, sqdiff, fitLabel
            
    except (RuntimeError, TypeError) as e:
        return [], [], sys.float_info.max, ""


# Fit multiple functions to the curve, return the fit with the lowest square error
def fitCurves(daynr, values):        
    bestFitFunc, bestPopt, bestPcov, bestSqdiff, bestFitLabel=[], [], [], sys.float_info.max, ""

    if values[-1]>1:
        # Fit on reduced dataset without leading zeros
        cropDaynr, cropValues=cropDuplicateLeadingZeros(daynr, values)
        fittings=[
            [ fitExp, [0.1, 0.1],            lambda popt: "f(x)=%g e^(%g x)" % (s3(popt[0]), s3(popt[1])) ],
            [ fitSig, [values[-1], 0.1, -10], lambda popt: "f(x)=%g /(1- e^(\n- %g x %+g))" % (s3(popt[0]), s3(popt[1]), s3(popt[2])) ]
        ]

        # Perform and evaluate curve fits
        for fitFunc, p0, eqFormatter in fittings:
            popt, pcov, sqdiff, fitLabel=fitCurve(cropDaynr, cropValues, fitFunc, p0, eqFormatter)
            if sqdiff<bestSqdiff:
                bestFitFunc, bestPopt, bestPcov, bestSqdiff, bestFitLabel=fitFunc, popt, pcov, sqdiff, fitLabel

    return bestFitFunc, bestPopt, bestPcov, bestSqdiff, bestFitLabel


# Fit curve for the top K valid countries 
topK=20
extDayCount=21
confdFits, recovdFits, deathsFits={}, {}, {}
confdProj7, confdProj14, confdProj21=[], [], []
np.seterr(over='ignore')
for i, cty in enumerate(validCountries[0:topK], start=0):
    name=cty['Name']

    days, daynr, values=getData(confd, name)
    confdFits[name]=fitCurves(daynr, values)
    fitFunc, fitPopt, fitPcov, fitSqdiff, fitLabel=confdFits[name]
    extDays, extDaynr=extendTime(days, np.array(daynr), extDayCount)
    proj=fitFunc(extDaynr, *fitPopt)
    confdProj7.append(proj[len(daynr)-1+7])
    confdProj14.append(proj[len(daynr)-1+14])
    confdProj21.append(proj[len(daynr)-1+21])
           
    days, daynr, values=getData(recovd, name)
    recovdFits[name]=fitCurves(daynr, values)
    days, daynr, values=getData(deaths, name)
    deathsFits[name]=fitCurves(daynr, values)

    
# Plot overview of projections
plt.rcParams['figure.figsize'] = [15, 8]
fig, ax=plt.subplots(nrows=1, ncols=4, constrained_layout=True)
fig.suptitle('Covid-19 Projections by Country as of %s (log scale)' % today, fontsize="16")

# Leftmost chart: Plot lastest confirmed cases by country
ax[0].set_title('Today')
ax[0].set_xscale('log')
ax[0].xaxis.set_major_formatter(hrFormatter)
ax[0].get_yaxis().set_visible(False)
ax2 = ax[0].twinx()
ax2.invert_yaxis()
ax2.invert_xaxis()
ax2.barh(countryNames[:topK], countryCases[:topK])
for i, v in enumerate(countryCases[:topK]):
    ax2.text(v, i, humanReadable(v), ha='right', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'))

# Subsequent charts: plot projections for some weeks out
for i, title, proj in [(1, "In one week", confdProj7), (2, "In two weeks", confdProj14), (3, "In three weeks", confdProj21)]:
    ax[i].set_title(title)
    ax[i].set_xscale('log')
    ax[i].xaxis.set_major_formatter(hrFormatter)
    ax[i].invert_yaxis()
    ax[i].get_yaxis().set_visible(False)
    ax[i].barh(countryNames[:topK], proj)
    for j, v in enumerate(proj):
        ax[i].text(v, j, humanReadable(v), ha='left', va='center', bbox=dict(boxstyle='square,pad=0', fc='w', ec='none'))

# Ensure all charts are drawn to the same scale
low, high=ax[-1].get_xlim()
ax2.set_xlim(high,low)
for i in range(1,3):
    ax[i].set_xlim(low, high)
        

Country Projection Details

Three charts per country: confirmed cases, recovered cases and deaths. Black dots show the actual values from Hopkins. The smooth lines have been fitted to the data, using the same curve fits as above. The dotted lines are two standard deviation bounds to the fitted data. The usual disclaimers about curve fits apply.

I have added a confidence range of +/- two standard deviations for each fitted curve parameter to the charts to reflect the uncertainty of curve fits. For the mathematically inclined, I'm using the square root of the diagonal of the covariance matrix to estimate standard deviations for each parameter individually; I am cavalierly neglecting the other parts of the covariance matrix which capture interdepdencies between the parameters. There clearly are limits to this two sigma approach: e.g., for some countries, the two sigma lower bound of the one week projection ends up lower than the already confirmed number of cases. So as always with projections, please take them with a grain of salt.

Net net, some country estimates have particularily high uncertainty right now, especially the US.

Some of the reported data remain suspicious, e.g. the low number of recovered cases in the US, France, Switzerland and the Netherlands. They may be due to delayed recoveries in the early stages of an epidemic, to a lack of data capture about recoveries, or other factors. The wide error bounds for the recoveries, or sometimes the lack of fitted curve at all, clearly show the limits of curve fitting.

The curve for Korea also appears to depart from both exponential and sigmoid growth functions. Happy to have your suggestions or contributions for further curve families to try and fit.

In [40]:
# Plot the raw data and the fitted curve on the given axes
def plotRawAndCurve(days, daynr, values, fitFunc, fitPopt, fitPcov, fitSqdiff, fitLabel, ax, label, name, fitDrawingStyle, extDayCount, yScale='linear'):
    # Plot raw data and label last data point 
    ax.set_yscale(yScale)
    ax.plot(days, values, 'ko', label="%s in %s" % (label, name))
    ax.text(days[-1], values[-1], "%s " % humanReadable(values[-1]), color='k', horizontalalignment="right", verticalalignment="bottom")

    # Plot the best fit projection, if it exists
    if fitSqdiff<sys.float_info.max:
        extDays, extDaynr=extendTime(days, np.array(daynr), extDayCount)
        
        # Calculate two standard deviations for each parameter individually
        # For simplicity, we are taking the diagonal of the covariance matrix,
        # i.e. the variances of each parameter with itself. The square root
        # of that is one standard deviation. 
        pcov2Sigma=[2*np.sqrt(fitPcov[i][i]) for i in range(len(fitPopt))]

        # Plot projected scenario 
        extProj=fitFunc(extDaynr, *fitPopt)
        line, =ax.plot(extDays, extProj,  fitDrawingStyle, label=fitLabel)

        # Plot low and high charts first
        poptLow=fitPopt-pcov2Sigma
        extProjLow=fitFunc(extDaynr, *poptLow)
        line, =ax.plot(extDays, extProjLow,  "%s-" % fitDrawingStyle, label="2\u03A3 low")
        ax.text(extDays[-1], extProjLow[-1], "  %s" % humanReadable(extProjLow[-1]), color=line.get_color(), backgroundcolor='w', horizontalalignment="left", verticalalignment="center" )

        poptHigh=fitPopt+pcov2Sigma
        extProjHigh=fitFunc(extDaynr, *poptHigh)
        line, =ax.plot(extDays, extProjHigh,  "%s-" % fitDrawingStyle, label="2\u03A3 high")
        ax.text(extDays[-1], extProjHigh[-1], "  %s" % humanReadable(extProjHigh[-1]), color=line.get_color(), backgroundcolor='w', horizontalalignment="left", verticalalignment="center" )

        # Label projected scenario last, so it appears on top
        ax.text(extDays[-1], extProj[-1], "  %s" % humanReadable(extProj[-1]), color=line.get_color(), backgroundcolor='w', horizontalalignment="left", verticalalignment="center" )

        
    # Finish up the formatting
    locator = mdates.AutoDateLocator(minticks=5, maxticks=10)
    formatter = mdates.ConciseDateFormatter(locator)
    ax.xaxis.set_major_locator(locator)
    ax.xaxis.set_major_formatter(formatter)
    ax.yaxis.set_major_formatter(hrFormatter)
    ax.legend()
    
# Plot fits for the top K valid countries and given number of extension days
extDayCount=7
for i, cty in enumerate(validCountries[0:topK], start=0):
    # Retrieve data and fit
    name=cty['Name']
    
    # Prepare plot area
    plt.rcParams['figure.figsize'] = [16, 6]
    fig, axs=plt.subplots(nrows=1, ncols=3, constrained_layout=True)
    plt.suptitle('%s: Confirmed cases, recovered and deaths on %s with %d day prediction' % (name, today, extDayCount))

    # Select and plot data for this country
    chartParams=[(confd, confdFits, axs[0], "Confirmed", "b-"), (recovd, recovdFits, axs[1], "Recovered", "g-"), (deaths, deathsFits, axs[2], "Deaths", "r-")]
    for df, fits, ax, legend, fitDrawingStyle in chartParams:
        days, daynr, values=getData(df, name)
        fitFunc, fitPopt, fitPcov, fitSqdiff, fitLabel=fits[name]
        plotRawAndCurve(days, daynr, values, fitFunc, fitPopt, fitPcov, fitSqdiff, fitLabel, ax, legend, name, fitDrawingStyle, extDayCount, yScale='linear')
    plt.show()

Return to daily series overview.

For questions and comments, please reach out to me on LinkedIn or Twitter.