Blog_Corona_Comparing_Nations

CORONA - COMPARING NATIONS

CORONA - COMPARING NATIONS
by Ditty April 2020
Keywords: Corana, COVID-19, Mortality Curve, Germany, Italy, The Netherlands and Spain
Recap - follow-up of previous posts.
In my first (introduction) post about the COVID-19 pandemic I presented a framework for how we can gradually increase knowledge about the Corona COVID19 phenomenon with the use of some key statistical (and data science) concepts. In the second post I tried to zoom in to how the virus spread in China with use of data made available by Johns Hopins University.  In this third post I will have a closer look at how the spreads developed between countries in Europe.
Introduction - What is going on 
Since the Corona / COVID19 outbreak last December 2019 that started in Wuhan, China - a lot has been written about the phenomenon and it was confirmed as pandemic by World Health Organization on the 11th of March 2020. According to Johns Hopkins University 184 countries (at the time of writing this post) have.been affected by the virus and have registered cases of COVID 19.  
Objective & Limitations - Keep it small gradually increase knowledge
Based on the large amount of data and information (news, blogs, socials or official government documents) that citizens are confronted with - I sometimes found it difficult to distinguish facts from opinions. Hence, I found myself analysing the Corona outbreak on a daily basis. Where I would try to fact-check statements of experts or politicians that I heard the day before on popular late night talk shows in the Netherlands. Like Jinek or Op1. All in order to try and make sense of the virus and the outbreak. 

From a statistical data science perspective pandemics are interesting subjects to study, as they cover a lot of key statistical concepts that we use in data science. E.g. distribution theories, significance levels , exponential growth etc. My objective in this series of posts is to try and balance a simplification of what is going on while also keeping it understandable for audiences that are less data-literate. Implying that is preferable to keep research questions small and by answering the individual questions per post - we are gradually building knowledge about the virus and it's outbreak and at the same time are aiming to help readers become more data-literate.  
Research Question - what to better understand
I noticed that with the spread intensifying a lot of attentions was given to the situation in Italy, as media portraid a morbid picture going on in Italian hospitals. Consequently a lot of people in other countries were concerned about the situation in The Netherlands and wether we were heading towards a similar situation as in Italy. At the same time a big increase was also noticeable in Spain, and while I was checking the numbers on a daily basis it struck me that compared to the size of the population Germany remained relatively low. Thus I wantend to look at the development of the spread in those specific countries.

For this series of posts I will be zooming in on the following research questions / topics: 
From observation, to research question and hypothesis.
In this third post I will have a closer look at how the spreads developed between countries. In particular at the time of writing this post, the registered amount of cases were increasing in the Netherland and media were focussed on comparing the situation to that of Italy. 
What however struck me at the time (of writing this post) is the amount of infected, hospitalised, ic-ed (intensive care) and deaths in Germany seemed to be on the low side. Based on this observation I was interested in investigating how the spreads developed across countries and if there are statistical grounds to support assumptions towards differences between countries or similarities between countries. 

From a data science perspective (how this might also work in your organisation) we translate observations into research questions into a hypothesis. An important phase in hypothesis development is, that in order to fairly compare observations (in this case The Netherlands, Italy and Germany) we need to take in account that we need the same unit of analysis under the same conditions or similar conditions. 

When comparing different countries a limitation can be found in the amount of new registered cases. As new cases have not been a reliable indicator, due to the fact that Corona COVID19 tests are scarce. Meaning that in the real world new cases would be higher that the reported new cases. 

For this reason we will omit this metric from the research, and focus on a metric that is more reliable across countries, namely the number of registered deaths related to COVID-19. Under the second presumption to create similar conditions to analysis, we need to translate the unit of analysis (number of registered) to comparable unit. At the time of this research for The Netherlands only two weeks of data was available. For that reason we will look at the development of the number of COVID-19 related deaths in the the first 14 days after the first COVID-19 related hospitalised patient per country. Summarising, the following research questions are formulated.

  1. Are there difference between the number of COVID-19 related deaths between Germany, Italy The Netherlands and Spain?

Managerial Implication 

The reason why I am paying extra attention to the research question formulation phase in the research (and for that sake any data science initiative) is that in practise I see that accountability of data science initiatives might seem arguable in a lot of (more) commercial settings. I believe that clearly translating observations into research questions and hypothesises will help these discussions and make transparent - the added value data science has in your organisation. 



Countries in the news
Just to give you an idea how the situation was in Italy, Germany, Spain and The Netherlands during the first month after the reported COVID-19 case in Italy. 
ITALY
GERMANY
SPAIN
The Netherlands
Tools and concepts to use
With the use of R (language and environment for statistical computing and graphics), I will be using data science techniques, concepts and methods like descriptive statistics, distribution theory normal distribution, box plots and multi-variate analysis. The objective is to statistically support claims about the differences between the number of COVID-19 related deaths in the the first 14 days after the first COVID-19 related hospitalised patient per country.


Let's get staRted - Data collection & Manipulation
For this analysis I used the data obtained from Johns Hopkins University's Github a place where they provide daily files with various information, eg number of new cases, number of death related to COVID-19 etc. This data is also forms the basis of their well-know dashboard that a lot of media use as main source of information in their articles. 


Meta Data
Before starting the analysis let's have a closer look at the data that is available. The image below shows that on a daily level / per country and region we have the (1) cumulative confirmed cases, (2) cumulative confirmed cases and (3) cumulative recovered cases.
Data Transformation
However for analysing how the spread develops it might be useful to add some additional daily metrics by data transformation. Also referred to as feature development. In this case we will add a lag function that calculates the metrics on a daily base by looking at the incremental change in comparison of the previous day of that specific province and country. 

The figure below shows the result after adding the lag function. Where the columns New Confirmed, New Recovered and New Deaths have been added to the table.
However keeping in mind that we would like to compare the countries with each other we would also like to create a feature that would represent the number of days after there first registered infection. As in a later stage this will provide information towards the time period from first registered case. Related to the daily mortality figures. In other words we would like to know how many days after the first registered infection mortality figures develop. 

This is done by adding an indicator for each Country based on the provided dates measured from the point that there is at least one confirmed case in that country. Where we aggregate the data from Province level to a Country level.

The image below show the result after adding and indicator that counts the days after the first infection. 

As you can see the variable rank tells us, that for the first row (being The Netherlands); the 3rd of March 2020 was the first day that there was a know COVID19-related mortality and that this was 9 days after the first registered COVID19 case.
Computation of such features is an essential part of the analysis and from that perspective I wanted to double check this information with other public sources. The images below show that tNational Institute for Public Health and the Environment Ministry of Health, Welfare and Sport report the first death on 6th of march on there website (2020, April)

Analysis 
Now that we have validated and transformed our data we are able to relatively easy plot the data on some graphs. The two graphs below illustrate how the COVID19 related mortalities developed as seen from the day of the first register case. Also referred to as mortality-curves
Mortality Curves
In the graph below you see how the mortalities developed (mortality curve) .
What is striking is that there seems to be differences between countries in the days between the first registered case and the first mortality. Therefor it might be fruitful to add more detail to the graph above by creating a new graph that illustrates the mortality curves per country. It is noticeable that the curves for Italy and Spain seem to change character from linear growth towards exponential growth, there were the curve noticeably steepens.
Conclusion
In this post I tried to give more insights on how COVID19 related mortalities developed in Germany, Italy, The Netherlands and Spain. As dutch media was paying a lot of attention to the situation in Italy and people were worried if the Dutch situation would develop similarly to that of Italy. Hence Spain's situation was also discussed as it was also worrying. In contrast to developments in Germany. 

On the basis of our analysis we found support that the Dutch situation remains worrying when taken into account that the growth character changed from linear to exponential. For Spain this growth type change occurred earlier that for Italy. For Italy this was around 28 days after there first COVID19 related mortality and for Spain this was around 12 days.

The noticeable difference between the county's time of the first registered COVID19 case and the related mortalities remains remarkable. This might give us hints towards how governmental policies differed in terms of their response towards the outbreak. 
Future Research
In this analysis we merely looked at the Mortality Curves of different countries in absolute vales. For future research it might be interesting to look at the virus and it's spread related to the (national) population, and thus making comparisons between countries robuster. Where adding data on how governments reacted (lockdown, semi-lock or none) may contribute in finding effective outbreak responses.
Concluding remarks - not an epidemiologists
Please note that I am not an epidemiologists and thus wary of relating the outcomes of my endeavours to the current situation. My aim is to create a better understanding about the virus and it's outbreak and hopefully help people (and myself) to place presented information in a more statistical perspective. 

Please also note that this post has not been peer-reviewed. The sole purpose of my endeavours is to present insights into the outbreak and how - with relative limited coding in R - it is possible to dive into the data yourself. Hopefully get you enthusiastic about the possibilities data science provides.
Hope you enjoyed reading! 

All the best,

Ditty

Ditty Menon, The Data Artists, Ditty

About the Author: Ditty Menon

Founder of The Data Artists, The Data Artists Music and Nederland Wordt Duurzaam


Erasmus University Rotterdam Alumni with 12 years of experience in Data Science / Analytics / Digital. Passionate about incorparating data into all aspects of life & (more recent) using data for a sustainable world.


Radom facts:

Starts his day with a flat white or caffe latte and the financial times podcast.

Broke his glasses when walking into a lamppost while thinking of a coding issue

Loves Serendipity

Share by: