Approximating Coronavirus Cases in 2019
JHU is as of 2019 using a GitHub repo to store its data instead of a web API, so the dataset we are using is in 2019. Other than scraping methodology, the subject context is the 2019-nCoV which is a contagious coronavirus that hailed from Wuhan, China. This new strain of virus has struck fear in many countries as cities are quarantined and hospitals are overcrowded. To help us understand how 2019-nCoV is spread around the world, let’s plot correlation across number of days. Special thanks to Johns Hopkins University for open-sourcing their dataset.
Our extraordinary .csv files, of which there are nine, include
2019_nCoV_20200121_20200205.csv
which is the dataset for this tutorial. In this tutorial, we are testing which provinces correlate with #Confirmed
by Last Update
, using 2019_nCoV_20200121_20200205.csv from kaggle.com/brendaso/2019-coronavirus-dataset-… and the setup is Python3 in zsh
, with which we seek to obtain weights on approximation.csv by invoking the following command: ~/python3 approximate_confirmed.py 3 Singapore 2019_nCoV_20200121_20200205.csv > approximation.csv
.
Country/Region Weight
0 South Korea 1.143
1 Hong Kong 0.269
2 Philippines -0.950
We begin writing in approximate_confirmed.py
, and the arguments supplied to our command include the number of countries/regions to include in our model, the country whose confirmed cases we are approximating, and our dataset. Because this command writes weights to approximation.csv
, we might also invoke these commands which allow us, instead of printing to approximation.csv
, to print on the zsh terminal; without the > approximation.csv
at the end of the command of invocation, the text instead of going to .csv goes to the terminal.
python3 approximate_confirmed.py 3 Mainland\ China 2019_nCoV_20200121_20200205.csv
python3 approximate_confirmed.py 6 Malaysia 2019_nCoV_20200121_20200205.csv
python3 approximate_confirmed.py 6 Finland 2019_nCoV_20200121_20200205.csv
python3 approximate_confirmed.py 6 Spain 2019_nCoV_20200121_20200205.csv
On our dataset it’s important to recall the Last Update column, for which we are using the date 01/01/2019 as the beginning, and the Country/Region column, which contains 32 unique countries across 35 days.
Let’s find different ways to estimate the approximate confirmed coronavirus cases in Country Y with few variables! Thank you and enjoy your stay. This requires importing the libraries numpy
to edit data frames, math
for math functions, matplotlib.pyplot
to visualize and to calculate Pearson correlation coefficient and p-value of the model, stats
from scipy
, pandas
to read .csv and mutate data frames, and LinearRegression
from sklearn.linear_model
to create the multilinear regression model, and sys
to access command line args. The datetime
library allows us to convert dates (2020–01–01, …, 2020–02–06) into integers (0, …, 35), and the textwrap
library allows us to keep large equations on the graph.
import numpy as np
import math
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
from sklearn.linear_model import LinearRegression
import sys
from datetime import date
from textwrap import wrap
Once we read in the data, turn it into a pandas
dataframe, create a separate column for each Country/Region, define a function that takes in the current day string, such as 2020–01–01, and returns the distance in days to 2020–02–06, the starting point.
data = pd.read_csv(sys.argv[3])
df = pd.DataFrame(data, columns=['Country/Region', 'Last Update', 'Confirmed'])
newColumns = np.append(['Last Update'], data['Country/Region'].unique())def convertDay(day):
arr2 = day.split(" ")[0].split("/")
arr1 = "1/1/2020".split("/")
if (arr2[2] == '20'):
arr2[2] = '2020'
arr1 = list(map(lambda x: int(x), arr1))
arr2 = list(map(lambda x: int(x), arr2))
first, second, third = arr1
f_date = date(third, first, second)
first, second, third = arr2
l_date = date(third, first, second)
delta = l_date - f_date
return (delta.days)
Based on the original column Last Update
, make a new Last Update
column with integer representations of each day, and update the Last Update
column,
newDates = list(map(lambda x: convertDay(x), data['Last Update']))
df['Last Update'] = newDates
create dfObj
, an empty dataframe with the following column names: Mainland China, Singapore, Thailand, …, Colombia,
dfObj = pd.DataFrame(columns=newColumns)
we start populating dfObj
. For all days 0, 1, …, 35, find all rows matching the current day, get all confirmed cases for that day, and if we had data for that day then add it to dfObj
at that day.
for day in range(0, max(df['Last Update'])):
uniqueCountry = df['Country/Region'].unique()
aa = pd.DataFrame(columns=newColumns)
aa = df[df["Last Update"] == day]
aa = aa.groupby("Country/Region").sum()
values = []
for country in uniqueCountry:
if (country in aa.iloc[:, 0].keys().tolist()):
values.append(aa.iloc[:, 1][country])
else:
values.append(0) if (len(values)):
dfObj.loc[day] = [day, *values]
At this stage going from data
to dfObj
, see how far we’ve gone: the translation is successful! From the command line arg Y supplied by the user
indexY = list(dfObj.columns).index(sys.argv[2])
we get the Y column. Let’s get all columns. Each country/region is one column, showing # Confirmed: Accumulated number of confirmed 2019-nCoV cases from the start to end date. To give axis 0 with size len(uniqueCountry)
,
X = dfObj.iloc[:, 1:len(uniqueCountry)]X = X.drop([sys.argv[2]], axis=1)X = X.values.reshape(-1, len(X.columns))
(Numpy should calculate the dimension of rows, and assume there are len(X.columns) columns. One for each Country/Region). Get the Country Y column, which is Y.
Y = dfObj.iloc[:, indexY].values.reshape(-1, 1)
(Numpy should calculate the dimension of rows, with 1 column for Y). This is just a preliminary multilinear model with all 32 countries.
model = LinearRegression(fit_intercept=False).fit(X, Y)
From Y, remove the list encasing number and because we’re about to calculate the Pearson coefficient of correlation, make it comma separated.
Y = Y.flatten()Y = list(Y)modelCoefficients = list(model.coef_.flatten())
An array of all coefficients! We rank these based on greatest Pearson coefficient of correlation, then statistically significant p-value (as long as it’s less than 0.05 then the symbol is statistically significant). We are not using the coefficients from the all-32-countries model because they are not standardized and assume a strong linear correlation with country Y.
For the visualization, focus on unique elements of Country/Region
.
countryRankings = [None] * len(uniqueCountry)fig, ax = plt.subplots(5, 6) fig.suptitle('All Countries/Regions Correlation with ' + sys.argv[2])countriesOriginal = list(data['Country/Region'].unique())countriesOriginal.pop(countriesOriginal.index(sys.argv[2]))
For each unique country, we want to find its importance.
for countryID in range(0, len(X[0])):
confirmedCases35DaysForCountry = [0] * max(df['Last Update'])
for i in range(0, len(X)):
confirmedCases35DaysForCountry[i] = X[i][countryID] pearson_coef, p_value = stats.pearsonr(confirmedCases35DaysForCountry, Y) countryRankings[countryID] = [pearson_coef, p_value,
modelCoefficients[countryID], countryID,
countriesOriginal[countryID]] countryRankings[countryID] = [
0 if x != x else x for x in countryRankings[countryID]] ax[math.floor(countryID / 6), countryID % 6].scatter(confirmedCases35DaysForCountry, Y, s=5, alpha=0.5, c='b') ax[math.floor(countryID / 6), countryID % 6].set(title=countriesOriginal[countryID]) ax[math.floor(countryID / 6), countryID % 6].tick_params(axis='both', labelsize=3, length=0)
So each entry in countryRankings is [Pearson Coefficient, P-Value, Coefficient in all-32 linear model, its index out of all stocks in alphabetical order, and the symbol itself]. The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). We want the lowest p-value, the greatest Pearson coefficient, and the greatest coefficient, so plot the confirmed cases of the country, and the confirmed cases of country Y. Then, set the title and remove ticks: only show outer labels and tick labels:
for a in ax.flat:
a.label_outer()
Set spacing between subplots,
plt.subplots_adjust(wspace=0.5, hspace=0.5)
render the figure,
plt.show()
remove None values, and sort from largest Pearson correlation coefficient to smallest.
countryRankings = [i for i in countryRankings if i] countryRankings = sorted(countryRankings, key=lambda x: (-x[0]))
If the p-value is greater than 0.05, then there’s a chance that in the context of coronavirus cases, the country isn’t really correlated with country Y. So, we need to move countries with large p-values (low statistical significance) to the end of the ranking.
statisticallySignificantValues = list(
filter(lambda x: x[1] <= 0.05, countryRankings))
Anything with a p-value larger than 5% means that the country may really not be correlated with country Y at all. That’s the tradition in statistics, so we are using 0.05 as the cutoff point. For these non-significant values, let’s sort them from largest coefficient to smallest coefficient. We want the variables to have the greatest influence on the Y value.
nonSignificantValues = filter(lambda x: x[1] > 0.05, countryRankings)nonSignificantValues = sorted(nonSignificantValues, key=lambda x: (-x[2]))
Re-combine the two sets.
countryRankings = statisticallySignificantValues + nonSignificantValues
Take the first n countries, those which have the highest ranking;
countryRankingsTopN = countryRankings[:int(sys.argv[1])]
columnIndices
has the indices of the countries we are going to include.
columnIndices = []
Find which column numbers from dfObj
correspond to the top N countries we found,
for x in countryRankingsTopN: columnIndices.append(list(dfObj.columns).index(x[4]))columnIndices = list(columnIndices)
from the existing dfObj
, take only the columns representing the N most important variables,
dfObjTopNRegions = dfObj.iloc[:, columnIndices]
convert the dataframe back into an array and within this array, each subarray represents one row of the original dataframe.
dfObjTopNRegions = dfObjTopNRegions.values.reshape(-1, int(sys.argv[1]))
Y is an array with all country Y confirmed case values.
model = LinearRegression(fit_intercept=False).fit(dfObjTopNRegions, Y)
The output of the model needs to be flattened and converted into the list
datatype.
modelCoefficients = list(model.coef_.flatten())
Print this or alternatively write this to .csv depending on how approximate_index.py
is invoked (python3 approximate_confirmed.py 3 Mainland\ China 2019_nCoV_2020_0121_20200205.csv > approximation.csv
).
res = pd.DataFrame(columns=['Country/Region', 'Weight'])
Now, find the columns at those indices and then get the column names from the .columns
property,
res['Country/Region'] = list(dfObj.iloc[:, columnIndices].columns)
round it to 3 places after the decimal,
res['Weight'] = np.around(modelCoefficients, 3)
(this is the only system output we want, because we are writing to .csv),
print(res)
and predict the response!
y_pred = model.intercept_ + np.sum(model.coef_*dfObjTopNRegions, axis=1)fig, ax = plt.subplots()ax.scatter(y_pred, Y, s=5, alpha=0.5, c='b')
String together the title, remove the extra + sign, and add the coefficient of determination (R²) just for fun.
title = sys.argv[2] + '= 'for i in range(len(res['Weight'])):
title += str(int(res['Weight'][i])) + res['Country/Region'][i] + ' + 'title = title[:len(title) - 3]title = title + ', R^2 = ' + \str(np.around(model.score(dfObjTopNRegions, Y), 3))
We can generate some cool plots!
plt.title('\n'.join(wrap(title, 60)), fontsize=10)plt.xlabel('Predicted Value of ' + sys.argv[2])plt.ylabel('True Value of ' + sys.argv[2])plt.show()
If you like this article, check out the original dataset which is a starting point for people to gather more statistics, stories, and government responses regarding epidemics.