Approximating Coronavirus Cases in 2019

8 min readOct 5, 2021

JHU is as of 2019 using a GitHub repo to store its data instead of a web API, so the dataset we are using is in 2019. Other than scraping methodology, the subject context is the 2019-nCoV which is a contagious coronavirus that hailed from Wuhan, China. This new strain of virus has struck fear in many countries as cities are quarantined and hospitals are overcrowded. To help us understand how 2019-nCoV is spread around the world, let’s plot correlation across number of days. Special thanks to Johns Hopkins University for open-sourcing their dataset.

All Countries/Regions, #Confirmed Correlation with Singapore

Our extraordinary .csv files, of which there are nine, include

2019_nCoV_20200121_20200205.csv

which is the dataset for this tutorial. In this tutorial, we are testing which provinces correlate with #Confirmed by Last Update, using 2019_nCoV_20200121_20200205.csv from kaggle.com/brendaso/2019-coronavirus-dataset-… and the setup is Python3 in zsh, with which we seek to obtain weights on approximation.csv by invoking the following command: ~/python3 approximate_confirmed.py 3 Singapore 2019_nCoV_20200121_20200205.csv > approximation.csv.

Country/Region Weight
0 South Korea 1.143
1 Hong Kong 0.269
2 Philippines -0.950

We begin writing in approximate_confirmed.py, and the arguments supplied to our command include the number of countries/regions to include in our model, the country whose confirmed cases we are approximating, and our dataset. Because this command writes weights to approximation.csv, we might also invoke these commands which allow us, instead of printing to approximation.csv, to print on the zsh terminal; without the > approximation.csv at the end of the command of invocation, the text instead of going to .csv goes to the terminal.

python3 approximate_confirmed.py 3 Mainland\ China 2019_nCoV_20200121_20200205.csv
python3 approximate_confirmed.py 6 Malaysia 2019_nCoV_20200121_20200205.csv
python3 approximate_confirmed.py 6 Finland 2019_nCoV_20200121_20200205.csv
python3 approximate_confirmed.py 6 Spain 2019_nCoV_20200121_20200205.csv

On our dataset it’s important to recall the Last Update column, for which we are using the date 01/01/2019 as the beginning, and the Country/Region column, which contains 32 unique countries across 35 days.

Let’s find different ways to estimate the approximate confirmed coronavirus cases in Country Y with few variables! Thank you and enjoy your stay. This requires importing the libraries numpy to edit data frames, math for math functions, matplotlib.pyplot to visualize and to calculate Pearson correlation coefficient and p-value of the model, stats from scipy, pandas to read .csv and mutate data frames, and LinearRegression from sklearn.linear_model to create the multilinear regression model, and sys to access command line args. The datetime library allows us to convert dates (2020–01–01, …, 2020–02–06) into integers (0, …, 35), and the textwrap library allows us to keep large equations on the graph.

import numpy as np
import math
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
from sklearn.linear_model import LinearRegression
import sys
from datetime import date
from textwrap import wrap

Once we read in the data, turn it into a pandas dataframe, create a separate column for each Country/Region, define a function that takes in the current day string, such as 2020–01–01, and returns the distance in days to 2020–02–06, the starting point.

data = pd.read_csv(sys.argv[3])
df = pd.DataFrame(data, columns=['Country/Region', 'Last Update', 'Confirmed'])
newColumns = np.append(['Last Update'], data['Country/Region'].unique())def convertDay(day):
     arr2 = day.split(" ")[0].split("/")
     arr1 = "1/1/2020".split("/")
     if (arr2[2] == '20'):
         arr2[2] = '2020'
     arr1 = list(map(lambda x: int(x), arr1))
     arr2 = list(map(lambda x: int(x), arr2))
     first, second, third = arr1
     f_date = date(third, first, second)
     first, second, third = arr2
     l_date = date(third, first, second)
     delta = l_date - f_date
     return (delta.days)

Based on the original column Last Update, make a new Last Update column with integer representations of each day, and update the Last Update column,

newDates = list(map(lambda x: convertDay(x), data['Last Update']))
df['Last Update'] = newDates

create dfObj, an empty dataframe with the following column names: Mainland China, Singapore, Thailand, …, Colombia,

dfObj = pd.DataFrame(columns=newColumns)

we start populating dfObj. For all days 0, 1, …, 35, find all rows matching the current day, get all confirmed cases for that day, and if we had data for that day then add it to dfObj at that day.

for day in range(0, max(df['Last Update'])):
     uniqueCountry = df['Country/Region'].unique()
     aa = pd.DataFrame(columns=newColumns)
     aa = df[df["Last Update"] == day]
     aa = aa.groupby("Country/Region").sum()
     values = []
     for country in uniqueCountry:
          if (country in aa.iloc[:, 0].keys().tolist()):
               values.append(aa.iloc[:, 1][country])
          else:
               values.append(0)     if (len(values)):
          dfObj.loc[day] = [day, *values]

At this stage going from data to dfObj, see how far we’ve gone: the translation is successful! From the command line arg Y supplied by the user

indexY = list(dfObj.columns).index(sys.argv[2])

we get the Y column. Let’s get all columns. Each country/region is one column, showing # Confirmed: Accumulated number of confirmed 2019-nCoV cases from the start to end date. To give axis 0 with size len(uniqueCountry),

X = dfObj.iloc[:, 1:len(uniqueCountry)]X = X.drop([sys.argv[2]], axis=1)X = X.values.reshape(-1, len(X.columns))

(Numpy should calculate the dimension of rows, and assume there are len(X.columns) columns. One for each Country/Region). Get the Country Y column, which is Y.

Y = dfObj.iloc[:, indexY].values.reshape(-1, 1)

(Numpy should calculate the dimension of rows, with 1 column for Y). This is just a preliminary multilinear model with all 32 countries.

model = LinearRegression(fit_intercept=False).fit(X, Y)

From Y, remove the list encasing number and because we’re about to calculate the Pearson coefficient of correlation, make it comma separated.

Y = Y.flatten()Y = list(Y)modelCoefficients = list(model.coef_.flatten())

An array of all coefficients! We rank these based on greatest Pearson coefficient of correlation, then statistically significant p-value (as long as it’s less than 0.05 then the symbol is statistically significant). We are not using the coefficients from the all-32-countries model because they are not standardized and assume a strong linear correlation with country Y.

For the visualization, focus on unique elements of Country/Region.

countryRankings = [None] * len(uniqueCountry)fig, ax = plt.subplots(5, 6)  fig.suptitle('All Countries/Regions Correlation with ' + sys.argv[2])countriesOriginal = list(data['Country/Region'].unique())countriesOriginal.pop(countriesOriginal.index(sys.argv[2]))

For each unique country, we want to find its importance.

for countryID in range(0, len(X[0])):
     confirmedCases35DaysForCountry = [0] * max(df['Last Update'])
     for i in range(0, len(X)):
          confirmedCases35DaysForCountry[i] = X[i][countryID]     pearson_coef, p_value = stats.pearsonr(confirmedCases35DaysForCountry, Y)     countryRankings[countryID] = [pearson_coef, p_value,
          modelCoefficients[countryID], countryID, 
          countriesOriginal[countryID]]     countryRankings[countryID] = [
          0 if x != x else x for x in countryRankings[countryID]]     ax[math.floor(countryID / 6), countryID % 6].scatter(confirmedCases35DaysForCountry, Y, s=5, alpha=0.5, c='b')     ax[math.floor(countryID / 6), countryID % 6].set(title=countriesOriginal[countryID])     ax[math.floor(countryID / 6), countryID % 6].tick_params(axis='both', labelsize=3, length=0)

So each entry in countryRankings is [Pearson Coefficient, P-Value, Coefficient in all-32 linear model, its index out of all stocks in alphabetical order, and the symbol itself]. The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). We want the lowest p-value, the greatest Pearson coefficient, and the greatest coefficient, so plot the confirmed cases of the country, and the confirmed cases of country Y. Then, set the title and remove ticks: only show outer labels and tick labels:

for a in ax.flat:
     a.label_outer()

Set spacing between subplots,

plt.subplots_adjust(wspace=0.5, hspace=0.5)

render the figure,

plt.show()

remove None values, and sort from largest Pearson correlation coefficient to smallest.

countryRankings = [i for i in countryRankings if i] countryRankings = sorted(countryRankings, key=lambda x: (-x[0]))

If the p-value is greater than 0.05, then there’s a chance that in the context of coronavirus cases, the country isn’t really correlated with country Y. So, we need to move countries with large p-values (low statistical significance) to the end of the ranking.

statisticallySignificantValues = list(
     filter(lambda x: x[1] <= 0.05, countryRankings))

Anything with a p-value larger than 5% means that the country may really not be correlated with country Y at all. That’s the tradition in statistics, so we are using 0.05 as the cutoff point. For these non-significant values, let’s sort them from largest coefficient to smallest coefficient. We want the variables to have the greatest influence on the Y value.

nonSignificantValues = filter(lambda x: x[1] > 0.05, countryRankings)nonSignificantValues = sorted(nonSignificantValues, key=lambda x: (-x[2]))

Re-combine the two sets.

countryRankings = statisticallySignificantValues + nonSignificantValues

Take the first n countries, those which have the highest ranking;

countryRankingsTopN = countryRankings[:int(sys.argv[1])]

columnIndices has the indices of the countries we are going to include.

columnIndices = []

Find which column numbers from dfObj correspond to the top N countries we found,

for x in countryRankingsTopN:     columnIndices.append(list(dfObj.columns).index(x[4]))columnIndices = list(columnIndices)

from the existing dfObj, take only the columns representing the N most important variables,

dfObjTopNRegions = dfObj.iloc[:, columnIndices]

convert the dataframe back into an array and within this array, each subarray represents one row of the original dataframe.

dfObjTopNRegions = dfObjTopNRegions.values.reshape(-1, int(sys.argv[1]))

Y is an array with all country Y confirmed case values.

model = LinearRegression(fit_intercept=False).fit(dfObjTopNRegions, Y)

The output of the model needs to be flattened and converted into the list datatype.

modelCoefficients = list(model.coef_.flatten())

Print this or alternatively write this to .csv depending on how approximate_index.py is invoked (python3 approximate_confirmed.py 3 Mainland\ China 2019_nCoV_2020_0121_20200205.csv > approximation.csv).

res = pd.DataFrame(columns=['Country/Region', 'Weight'])

Now, find the columns at those indices and then get the column names from the .columns property,

res['Country/Region'] = list(dfObj.iloc[:, columnIndices].columns)

round it to 3 places after the decimal,

res['Weight'] = np.around(modelCoefficients, 3)

(this is the only system output we want, because we are writing to .csv),

print(res)

and predict the response!

y_pred = model.intercept_ + np.sum(model.coef_*dfObjTopNRegions, axis=1)fig, ax = plt.subplots()ax.scatter(y_pred, Y, s=5, alpha=0.5, c='b')

String together the title, remove the extra + sign, and add the coefficient of determination (R²) just for fun.

title = sys.argv[2] + '= 'for i in range(len(res['Weight'])):
     title += str(int(res['Weight'][i])) + res['Country/Region'][i] + ' + 'title = title[:len(title) - 3]title = title + ', R^2 = ' + \str(np.around(model.score(dfObjTopNRegions, Y), 3))

We can generate some cool plots!

plt.title('\n'.join(wrap(title, 60)), fontsize=10)plt.xlabel('Predicted Value of ' + sys.argv[2])plt.ylabel('True Value of ' + sys.argv[2])plt.show()

All Countries/Regions, #Confirmed Correlation with Mainland China

All Countries/Regions, #Confirmed Correlation with Malaysia

If you like this article, check out the original dataset which is a starting point for people to gather more statistics, stories, and government responses regarding epidemics.

Approximating Coronavirus Cases in 2019

Written by Dean Gladish