Creating an hex map of France electricity consumption

The French Ministry for the Ecological and Inclusive Transition (for which I’m currently working) is ongoing a process of opening data related to energy consumption. Each year, we publish data for every neighborhood in France (at the iris statistical level, even adresses in some cases) and to the nature of the final consumer (a household, an industry, a shop…). These data are available here (website in French – direct link to 2018 electricity consumption data).

Making a map to have an overlook at the situation isn’t easy, because the administrative boundaries of France are very diverse, and a direct mapping will reflect this situation. In order to overcome this issue, a solution is to use a different way to represent the country. In this post, we’re going to talk about hex map (as inspired by this website).

What’s a hex map? First, we need to talk about chloropleth maps; despite their unusual names, they’re perfectly common maps where each area is shaded or colored differently according to the values taking by a variable. They’re quite similar to maps such as this one showing which political party won an election on each voting district, but for continuous variable, and required a scale. To determine the color used for each area, we need to compute the average (or any statistical summary) of the values of every individual belonging in this area.

Hex map are entirely composed of hexagons (something perfectly suited for France!); they’re chloropleth maps, so each hexagon is colored depending on the values taking by a variable. The main difference is that the hexagon aren’t usual geographical areas so we need a way to attribute each value in our data set to one of the hexagons before calculating averages, and therefore colors.

In order to do that, we’re going to load some usual R-packages:

library(tidyverse)  
library(viridis)   
library(ggplot2)   
library(stringr)

Our objective will be to make an hex map of residential consumption of electricity in 2018. First, we need to import the data at iris level (doing this, we miss some of the electricity consumption only associated to larger geographical area, but it’s marginal); as the data is coded as strings with some missing values, we start by cleaning the data.

elec <- read.csv2("donnees_elec_iris_2018.csv",stringsAsFactors = F)
elec <- elec[elec$CONSO != "s" & elec$PDL != "s",] #"s" are missing values

# CONSO (electricity consumption) is coded with a french comma separator ; we need to substitute that and convert to numeric CONSO and PDL (number of  energy delivery points) 
elec$CONSO <- str_replace(elec$CONSO,",",".") 
elec$CONSO <- as.numeric(elec$CONSO)
elec$PDL <- as.numeric(elec$PDL)

# Select only residential sector
res <- elec[elec$CODE_GRAND_SECTEUR == "R",]

Each data point is then associated to the coordinates of the center of the neighborhood, using a iris shapefile (available here). The res dataset obtained looks like this:

  CODE_IRIS      CONSO  PDL     x    y
  100020000   843.8162  124 48.25 4.69
  100030000 10374.7166 1987 48.24 3.72
  100040000   679.8085   88 48.59 4.12
  100050000   769.8268  145 48.29 4.49
  100060000  7318.2055 1534 48.53 4.14
  100070000   429.9674   61 48.16 4.73
  100080000   343.1404   62 48.25 4.60
  100090000   262.1539   55 48.05 4.28
  100100000   140.5366   28 48.54 4.61
  100110000   700.1244  113 48.27 4.74

We’re using a ggplot2 environment to create our hex map. The aesthetics used are aes(x,y), the spatial coordinates. We begin by these line of code in order to use a simpler theme (as we’re creating a map, we don’t need any axis) and to specify the positions (longitude and latitude) we want to map.

c <- ggplot(res, aes(x, y))
c + 
xlim(-9, 12) +
ylim(40, 52) +   
theme_void() 

The main argument of the stat_summary_hex function we’re going to use is bins. This parameter allows us to choose how many hexagons will be displayed on the map; the more hexagon, the more small local variations are shown. For the other parameters, we specify aes(z = CONSO) as we want to color the hexagons in relation to the CONSO (Consumption in french) of the data points contained in this area. The fun = “sum” parameter means that the choice of the color of the hexagon depends on the total consumption inside. Lastly, we add a colour=”grey” parameter, which defines the color of every hexagon’s borders, because using white borders on my computer leads to some graphical artefacts (see here).

c <- ggplot(res, aes(x, y))
c +  stat_summary_hex(bins=40,aes(z = CONSO),fun = "sum", colour="grey") +
xlim(-9, 12) +
ylim(40, 52) +   
theme_void() 

We then define a scale (both numeric and in terms of colors) using the scale_fill_viridis function. In order to better see large differences, we use the trans=”log” option which means that transitions between colors are logarithmic and not arithmetic progressions. The breaks points used are 1 GWh, 10 GWh, 100 GWh and 1 TWh.

 c <- ggplot(res, aes(x, y))
c +  stat_summary_hex(bins=40,aes(z = CONSO),fun = "sum", colour="grey") +
xlim(-9, 12) +
ylim(40, 52) +   
theme_void() +
scale_fill_viridis(
 option="A",
 trans = "log", 
 breaks = c(0,1000,10000,100000,1000000,99999999999),
 name="Electricity Consumption", 
 guide = guide_legend(keyheight = unit(2.5, units = "mm"),      keywidth=unit(4, units = "mm"), 
label.position = "bottom", 
title.position = 'top', nrow=1) )  

Using the following code with bins = 80 results in this another map:

c <- ggplot(res, aes(x, y))
c +  stat_summary_hex(bins=80,aes(z = CONSO),fun = "sum", colour="grey") +
xlim(-9, 12) +
ylim(40, 52) +   
theme_void() +
scale_fill_viridis(
 option="A",
 trans = "log", 
 breaks = c(0,1000,10000,100000,1000000,99999999999),
 name="Electricity Consumption", 
 guide = guide_legend(keyheight = unit(2.5, units = "mm"),      keywidth=unit(4, units = "mm"), 
label.position = "bottom", 
title.position = 'top', nrow=1) )  

These map highlight the position of urban centers with high population density. A more useful information is the total electricity consumption per household (or, more accurately in this case, per energy delivery point, whose quantity per iris is PDL in the dataset), defined this way:

res$ElectricityPerHousehold <- res$CONSO/res$PDL

The code used to generate the map is slighty similar; the stat_summary_hex parameter are modified to aes(z = ElectricityPerHousehold) and fun=”mean”, and, scale_fill_viridis parameters are switched to option=”E” (leading to a new color scheme), new breakpoints of 2, 4, 6 and 8 GWh per household, and a new legend name.

 c <- ggplot(res, aes(x, y))
c +  stat_summary_hex(bins=80,aes(z = ElectricityPerHousehold),fun = "mean", colour="grey") +
xlim(-9, 12) +
ylim(40, 52) +   
theme_void() +
scale_fill_viridis(
 option="E",
 trans = "log", 
 breaks = c(0,2,4,6,8,10),
 name="Electricity Consumption per Household", 
 guide = guide_legend(keyheight = unit(2.5, units = "mm"),      keywidth=unit(4, units = "mm"), 
label.position = "bottom", 
title.position = 'top', nrow=1) )  

This map better hightlights places with large electricity consumptions; Paris and its suburbs are associated to lower consumptions per household, which contrasts with an unusually high total consumption, due to high population density. An high consumption per household is usually associated to electric heating and/or large houses, and parisian dwellings are mostly smaller flats.

Featured image: Electricity Converter Power, by ArtisticOperations

[19] Données du chômage avec R et SDMX-ML

Aujourd’hui un petit post un peu plus “pratique”. On va réaliser le graphique du taux de chômage en France depuis 1975 en utilisant R. Les données sont disponibles sur le site de l’INSEE. En suivant ce lien on va pouvoir les télécharger au format csv. Mais il est beaucoup plus sympathique d’utiliser une méthode un peu plus automatique pour récupérer ces données. Ainsi, dès que l’INSEE les mettra à jour (le trimestre prochain par exemple), il suffira de relancer le script R et le graphe se mettra automatiquement à jour.

Pour ce faire, on va utiliser la compatibilité du nouveau site de l’Insee avec la norme SDMX. Le package R rsdmx va nous permettre de récupérer facilement les données en utilisant cette fonctionnalité :

install.packages("rsdmx")
library(rsdmx)

Sur la page correspondant aux données sur le site de l’Insee, on récupère l’identifiant associé : 001688526. Cela permet de construire l’adresse à laquelle on va pouvoir demander à rsdmx de récupérer les données : http://www.bdm.insee.fr/series/sdmx/data/SERIES_BDM/001688526. On peut également ajouter des paramètres supplémentaires en les ajoutant à la suite de l’url (en les séparant par un “?”). On utilise ensuite cette adresse avec la fonction readSDMX:

url_chomage_insee <- "http://www.bdm.insee.fr/series/sdmx/data/SERIES_BDM/001688526?periodeDebut=1&anneeDebut=1975&periodeFin=3&anneeFin=2016&recherche=criteres&codeGroupe=1533"
donnees_chomage_sdmx <- readSDMX(url_chomage_insee)
donnees_chomage <- as.data.frame(donnees_chomage_sdmx)

On peut regarder la structure des données que l’on a à ce stade :

> str(donnees_chomage)
'data.frame':	167 obs. of  12 variables:
 $ IDBANK          : Factor w/ 1 level "001688526": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQ            : Factor w/ 1 level "T": 1 1 1 1 1 1 1 1 1 1 ...
 $ TITLE           : Factor w/ 1 level "Taux de chômage au sens du BIT - Ensemble - France métropolitaine - Données CVS": 1 1 1 1 1 1 1 1 1 1 ...
 $ LAST_UPDATE     : Factor w/ 1 level "2016-11-17": 1 1 1 1 1 1 1 1 1 1 ...
 $ UNIT_MEASURE    : Factor w/ 1 level "PCT": 1 1 1 1 1 1 1 1 1 1 ...
 $ UNIT_MULT       : Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
 $ REF_AREA        : Factor w/ 1 level "FM": 1 1 1 1 1 1 1 1 1 1 ...
 $ DECIMALS        : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
 $ TIME_PER_COLLECT: Factor w/ 1 level "PERIODE": 1 1 1 1 1 1 1 1 1 1 ...
 $ TIME_PERIOD     : chr  "2016-Q3" "2016-Q2" "2016-Q1" "2015-Q4" ...
 $ OBS_VALUE       : chr  "9.7" "9.6" "9.9" "9.9" ...
 $ OBS_STATUS      : chr  "P" "A" "A" "A" ...

Les deux colonnes qui vont nous intéresser sont “TIME_PERIOD” (date de la mesure) et “OBS_VALUE” (taux de chômage estimé). Pour le moment, ces données sont au format “caractère”. On va les transformer respectivement en “date” et en “numérique”. Notez que l’intégration du trimestre nécessite de passer par le package “zoo” (pour la fonction as.yearqtr) :

library(zoo)

donnees_chomage$OBS_VALUE <- as.numeric(donnees_chomage$OBS_VALUE)

donnees_chomage$TIME_PERIOD <- as.yearqtr(donnees_chomage$TIME_PERIOD, format = "%Y-Q%q")

Il ne reste plus qu’à utiliser l’excellent package ggplot2 pour créer le graphe. On en profite pour ajouter des lignes verticales en 1993 et 2008 (dates des deux dernières crises économiques majeures) :

library(ggplot2)
 
plot_chomage <- ggplot(donnees_chomage, aes(x=TIME_PERIOD, y=OBS_VALUE)) +
 geom_line(colour="red") +
 xlab("Annee") +
 ylab("Taux de chomage") +
 geom_vline(xintercept = c(1993,2008), color="grey", linetype="dashed") +
 annotate("text", c(1993,2008), c(0.15), label=c("1993","2008"), 
 color="grey")

print(plot_chomage)

Taux de chômage en France entre 1975 et 2016
Taux de chômage en France entre 1975 et 2016

Rendez-vous demain pour un article où on s’amusera à faire ressembler ce graphe à ceux du comic xkcd.