Consecutive days of good weather in the Netherlands

2018-05-14 7 min read data analysis

I have recently moved in North Holland and in the past weeks the weather was particularly fortunate: for many (consecutive) days there was no rain and the temperature have been very high for this area (the maximum temperature was easily above 25° degrees).

Given that I have no experience yet for this weather, I asked around how frequently this happens and I got diverse answers. Then, I have decided to look at some historical time-series of temperature and precipitation to try to satisfy my curiosity.

Firstly, I have downloaded the data of the weather stations part of the ECA (European Climate Assessment) dataset. The dataset is freely available for non-commercial research and education. In the North Holland area there are 4-5 suitable weather stations and I opted for the one from Schiphol, which is the longest one with the data starting in 1951 (temperature, precipitation data is even longer). To give you an idea, this is the list of the stations providing maximum temperature data available in the latest release of the dataset (17.0).

To carry out the analysis I have used the core Tidyverse R packages and lubridate (one non-core tidyverse package) plus zoo.

library(lubridate)
library(tidyverse)
library(zoo)

I have downloaded two large zip files containing all the station data for precipitation (rr) and maximum temperature (tx) and from them I have selected the files for Schiphol. I have loaded them into memory with the following lines:

read_station <- function(filename) {
    read_csv(filename,
    col_types = cols(DATE = col_datetime(format = "%Y%m%d")),
    skip = 20
    )
}
station_files <- c("RR_STAID000593.txt", "TX_STAID000593.txt") # SCHIPHOL (NL)
print(read_station(station_files[1]))

## # A tibble: 55,273 x 5
##    STAID SOUID DATE                   RR  Q_RR
##    <int> <int> <dttm>              <int> <int>
##  1   593   503 1866-12-01 00:00:00 -9999     9
##  2   593   503 1866-12-02 00:00:00 -9999     9
##  3   593   503 1866-12-03 00:00:00 -9999     9
##  4   593   503 1866-12-04 00:00:00 -9999     9
##  5   593   503 1866-12-05 00:00:00 -9999     9
##  6   593   503 1866-12-06 00:00:00 -9999     9
##  7   593   503 1866-12-07 00:00:00 -9999     9
##  8   593   503 1866-12-08 00:00:00 -9999     9
##  9   593   503 1866-12-09 00:00:00 -9999     9
## 10   593   503 1866-12-10 00:00:00 -9999     9
## # ... with 55,263 more rows

print(read_station(station_files[2]))

## # A tibble: 24,562 x 5
##    STAID SOUID DATE                   TX  Q_TX
##    <int> <int> <dttm>              <int> <int>
##  1   593  2114 1951-01-01 00:00:00 -9999     9
##  2   593  2114 1951-01-02 00:00:00    18     0
##  3   593  2114 1951-01-03 00:00:00    16     0
##  4   593  2114 1951-01-04 00:00:00    19     0
##  5   593  2114 1951-01-05 00:00:00    84     0
##  6   593  2114 1951-01-06 00:00:00    86     0
##  7   593  2114 1951-01-07 00:00:00    86     0
##  8   593  2114 1951-01-08 00:00:00    70     0
##  9   593  2114 1951-01-09 00:00:00    63     0
## 10   593  2114 1951-01-10 00:00:00    61     0
## # ... with 24,552 more rows

In the resulting tibbles we have the daily points of precipitation (RR) and its quality code (Q_RR), and maximum temperature and its quality code. The first two columns are the station and the source identifiers and the third, most important, is the date for the specific sample. Now we need to join the two data structures, select the columns we need and process a bit the outuput because both the temperature and precipitation are scaled of a factor 10 as specified in the metadata.

merged <- inner_join(read_station(station_files[1]),
    read_station(station_files[2]),
    by = c("DATE", "STAID")
) %>%
    select(STAID, date = DATE, prec = RR, tmax = TX) %>%
    filter(tmax > -9999) %>% # -9999 is the MISSING VALUE CODE 
    mutate(
    tmax = tmax / 10,
    prec = prec / 10
    ) 
    
print(merged)

## # A tibble: 24,561 x 4
##    STAID date                 prec  tmax
##    <int> <dttm>              <dbl> <dbl>
##  1   593 1951-01-02 00:00:00   0     1.8
##  2   593 1951-01-03 00:00:00   0     1.6
##  3   593 1951-01-04 00:00:00   3.4   1.9
##  4   593 1951-01-05 00:00:00   7.4   8.4
##  5   593 1951-01-06 00:00:00   2.1   8.6
##  6   593 1951-01-07 00:00:00   0.2   8.6
##  7   593 1951-01-08 00:00:00   6     7  
##  8   593 1951-01-09 00:00:00   2.1   6.3
##  9   593 1951-01-10 00:00:00  10.7   6.1
## 10   593 1951-01-11 00:00:00   9.9   7.4
## # ... with 24,551 more rows

Ok, now we have a tidy and clean data structure with all the data we need for my analysis. Now, let’s turn it into information. The main point is to define — in my view — the ``good'' weather. My aim is to understand how often we have at least three days without precipitation and with the maximum temperature greater than 22°C. Thus, I will create three boolean variables: 1) if a day is without precipitation; 2) if the maximum temperature is greater than 22°C; 3) the logical AND of the first two (a day without precipitation and with the maximum temperature > 22°C).

And how to capture the consecutive periods? I have decides to apply a rolling function using the prod function (the product of all the values in a vector) with a window of three days. In this way, the output of the function would be equal to one only if in all the previous days we had a TRUE condition (don’t forget that in R the logical values FALSE and TRUE are converted to numeric as 0 and 1). But, what if we are lucky enough to have a period longer than 3 days? For my analysis I am looking for all the periods of at least three days. The solution I have implemented was using the diff function and counting only the ones: when we go from zero (i.e. condition not satisfied in the previous three days) to one (i.e. condition satisfied). Here the code:

processed_data <- merged %>%
    mutate(
    no_prec = prec == 0,
    high_tmax = tmax > 22,
    good_weather = no_prec & high_tmax
    ) %>%
    mutate(
    roll_good_weather = zoo::rollapply(good_weather, width = 3, FUN = prod, fill = NA, align = "right"),
    good_weather_period = c(NA, diff(roll_good_weather)) == 1
    )

print(processed_data[23988:23998,]) # a part of the dataset that could help to understand the algorithm

## # A tibble: 11 x 9
##    STAID date                 prec  tmax no_prec high_tmax good_weather
##    <int> <dttm>              <dbl> <dbl> <lgl>   <lgl>     <lgl>       
##  1   593 2016-09-04 00:00:00   3.5  20.8 FALSE   FALSE     FALSE       
##  2   593 2016-09-05 00:00:00   0.8  21.8 FALSE   FALSE     FALSE       
##  3   593 2016-09-06 00:00:00   0    22.8 TRUE    TRUE      TRUE        
##  4   593 2016-09-07 00:00:00   0    26.1 TRUE    TRUE      TRUE        
##  5   593 2016-09-08 00:00:00   0    27   TRUE    TRUE      TRUE        
##  6   593 2016-09-09 00:00:00   0    22.9 TRUE    TRUE      TRUE        
##  7   593 2016-09-10 00:00:00   0    25.5 TRUE    TRUE      TRUE        
##  8   593 2016-09-11 00:00:00   0.2  21.7 FALSE   FALSE     FALSE       
##  9   593 2016-09-12 00:00:00   0    27.6 TRUE    TRUE      TRUE        
## 10   593 2016-09-13 00:00:00   0    30.5 TRUE    TRUE      TRUE        
## 11   593 2016-09-14 00:00:00   0    31   TRUE    TRUE      TRUE        
## # ... with 2 more variables: roll_good_weather <dbl>,
## #   good_weather_period <lgl>

Ok, now the good_weather_period column is TRUE when we had a period of “good” weather (according to my definition). Let’s see, usingtable\, how many periods of good weather we had since 1951 and in which month of the year:

print(
    table(month = month(processed_data$date), 
        good_weather = processed_data$good_weather_period)
)

##      good_weather
## month FALSE TRUE
##    1   2104    0
##    2   1921    0
##    3   2108    0
##    4   2004    6
##    5   2053   24
##    6   1961   49
##    7   2007   70
##    8   2021   56
##    9   1988   22
##    10  2077    0
##    11  2010    0
##    12  2077    0

Ok, in the last 67 years we had in April only six events of good weather, and in May instead 24. We have an answer: it is definitely not common to have three consecutive days of nice weather in the North Holland. But this is should not be a surprise according to the Internet..

Perhaps we can add another question, a bit more scientifical: do we have a trend? Is the frequency of those events changing? Let’s get a first answer by using again a rolling function, this time directly on the good_weather_period variables looking for the number of events in a window of 20 years.

plot(rollapply(processed_data$good_weather_period, 365 * 20, align = "right", FUN = sum, na.rm = TRUE, fill = NA) ~ processed_data$date, 
        type = "l", lwd = 2, ylab = "number of events in 20 years", xlab = 'date', 
        main = 'number of events of good-weather events in 20 years')

Consecutive days of good weather

That is definitely a positive trend, probably due to the well-known positive trend of temperature over Europe and the Netherland. This is an example of how to use (free) observed data and some relatively simple code to try to understand the “weather” of a specific area. To this end, I would also suggest to keep an eye on the Copernicus Climate Change products, that are free for everyone and up-to-date.

Consecutive days of good weather in the Netherlands

Lead Data Science Services

Related