The Year of Fake Statistics? How Do Analysts Study the Brexit Vote?
In the world of “post-truth” politics, the established media plays an important role as a source of verified information and credible analyses. With the increased public availability of data and computing tools, it is commonplace for journalists to use custom figures and charts to convey this information to their readers (leading to a new field of data journalism). These articles are easy to find: The New York Times offer a collection of “Raw Data” columns, The Guardian has a data blog, and the Washington Post collects its data visualisation articles in addition to having its own GitHub page.
After the UK’s EU Referendum last June, data journalists scrambled to make sense of the results. Common explanations of voting patterns include: age, level of education, and income (e.g. this article, or this article). So how did the media perform its analyses? Are these the best explanations of voting patterns? In this blog post (which is part worksheet), I will show you how you can use the available data to re-create and assess three of the main explanations presented in the media.
It is important to recognize the limitations and possibilities associated with the data we rely on. The term available data is key here—demographic data is widely available and easy to measure, unlike many other potential factors that influenced how people voted (e.g. a desire or lack thereof to travel or study in the EU, family connections in the EU, a sense of European or British identity, etc.). This means that we only have the data to quantitatively evaluate certain explanations. The good news is that more fine-grained and diverse data is becoming increasingly available, so we can look forward to better analyses in the future.
At this stage, ask yourself: If you had voted in the EU referendum, what factors would have influenced your decision, or if you did vote, what factors influenced your decision? Are these factors possible to measure? Why or why not? Keep your answers in mind while doing/looking through the worksheet below.
To do this we will use the open-source programming language R. The benefit of R is that it is versatile (e.g. in addition to creating the figures below in R, I wrote this blog post using RStudio’s Notebook feature). To get up to speed with the basics of R, try these excellent tutorials: tutorial 1 or tutorial 2. I recommend RStudio as your R interface.
You will also need three libraries/packages: ggplot2 (for making plots) and plotly *(for making interactive plots) to do this activity at home. For this post, I used:
- version 2.2.1 of ggplot2,
- version 4.5.6 of plotly,
As all of the commands and output are provided, you can also just follow along with the results.
1 The Data
First let’s get to know our data. The most common data presented in the media is the EU referendum voting data, which gives an account of the percentage of voters in areas and regions accross the UK that voted for or against Brexit. This data has often been paired with UK Census data, which can be broken down into the same areas and regions to provide some contextual information about the characteristics of voters. It is important to note that the last UK Census was in 2011, so the data is not a perfect match with the 2016 referendum results. To facilitate the analysis, I have compiled the voting and census data into one file that you can load at home.
The EU referendum voting data comes directly from the Electoral Commission. All remaining variables are from the 2011 UK Census, except for the ‘earnings’ variables which are from the 2016 Annual Survey of Hours and Earnings.
2 Loading Data into RStudio (or the R interface you prefer)
To load the data we will be using (a .csv file), run the following command (e.g. in RStudio). To do so, either type the command into your script and run it from there or paste it directly into RStudio’s Console and press enter.
Brexit_data <- read.csv("http://andy.egge.rs/data/brexit/brexit_data.csv", stringsAsFactors=F) # stringsAsFactors=F tells R how to deal with non numeric data.
Now we have loaded our data and named it “Brexit_data”. To view the data, click on it in the Environment or run the following command:
In the data viewer, you can click on the row names of the columns to sort the data.
3 Working with the EU referendum data
The main assessment of voting data after the UK’s EU Referendum came in the form of scatterplots and associated correlations. Correlation is a measure of the similarity between two variables. It tells us if two variables follow a similar pattern (or not). We can visually assess correlation using a basic scatterplot that has one variable on the horizontal axis and one on the vertical axis. If we find that there is no discernable pattern, our variables are not very correlated at all.
So let’s begin our analysis by making some scatterplots! The plots below are similar to those you may have seen after the referendum.
3.1 Age and Voting Remain
We will start with the relationship between age and voting patterns. In order to analyze for ourselves the relationship between age and EU referendum voting, we can make a plot of the percent of voters who voted “remain” against the mean age in each area. To do so run the following commands:
library(ggplot2) # loads the plotting library we need remain_age_basic <- ggplot(Brexit_data, # make a plot using data from Brexit_data # & name it 'remain_age' aes(x=age_mean, y=Percent_Remain, label=Area)) + # pick x and y # variables & labels # note the labels won't show up until the interactive plot # in the next command theme_bw() + # Remove gray background (i.e. make black and white) geom_point(shape=1) + # Use hollow circles for dots with shape=1 scale_x_continuous(limits = c(30, 50)) + # sets x-axis scale scale_y_continuous(limits = c(20, 80)) + # sets y-axis scale geom_smooth(method=lm, # Add line of best fit se=FALSE) + # Don't add shaded confidence region xlab("Age (average)") + # Label x axis ylab("Remain votes (%)") # Label y axis remain_age_basic # View the plot
This creates a custom scatterplot of our two variables. When we create scatterplots to assess correlation, it is important that we ask ourselves why the variables we look at might be related to eachother.
In this case, why might mean age have a relationship with voting habits in the EU referendum? What other explanations might do better?
What can this plot tell us about the relationship between the two variables? Hint: What do the axes represent?
What should we be aware of when using this data?
While this plot shows us some of the story, it does not allow us to know which points represent each area. For example, there is a point near x=42 and y=70 that seems to not quite follow the same pattern as the rest of the points — but we cannot tell which area it represents. If we were to label the points here though, the labels would overlap.
Sometimes a better way to assess a relationship is through interactive plots, which are becoming increasingly common in digital media. Interactive plots can include labels without overlap, as they only appear when you hover over the points. We will try that out in the next plot.
Optional: You can also try running this plot using the “Percent_Leave” data instead. To do so, you can copy the command above but change the relevant variable. Name this new plot leave_age (Hint: remember to use the “<-” symbol to name your object).
3.1.1 Making interactive plots
As mentioned above, if we want to get more from our plots, we can make them interactive. For our scatterplots, this means that we can hover over parts of our plot to get more details or view only some of the data at a time (e.g. the data for a specific region).
Let’s use an interactive plot to check to see if there is any regional variation of interest.
library(plotly) # loads the library we need to make our plots interactive remain_age_region <- ggplot(Brexit_data, aes(x=age_mean, y=Percent_Remain, colour=Region)) + theme_bw() + # Remove gray background (i.e. make black & white) geom_point() + # Use filled in circles by not selecting a shape geom_smooth(method=lm, # Add line of best fit se=FALSE) + # Don't add shaded confidence region scale_y_continuous(limits = c(20, 80)) # sets x axis remain_age_region <- ggplotly(remain_age_region) # Make the plot interactive remain_age_region # View the plot
Hover your mouse over the different points on the plot. What do you notice?
You can also try zooming in and out and looking at the other options in the menu at the top of the plot.
Does this plot give you any more useful information than the previous ones?
Can this plot tell us anything about regional variation in the relationship between age and voting ‘remain’ in the referendum? (Hint: look at the slope of the lines, also try clicking on the coloured dots in the legend to include and remove certain data)
A common take home point after the referendum was that younger people voted to remain in the EU. But, from this plot we can see that the relationship between mean age and voting isn’t as simple as that. There is regional variation. In the West Midlands, the relationship goes the opposite way!
We can also see from this plot that the point that was not part of the cluster (near x=42 and y=70), was from London. In fact when only looking at London, the slope of the line was the steepest of all regions; yet, the points are very dispersed. This suggests that the relationship between age and voting habits is not entirely clear. We should also keep in mind that the mean age in the different areas of London was lower than that in most other regions.
It is also good to point out that there was no available data for Scotland or Northern Ireland, which may or may not have followed different patterns.
3.2 Earnings and Voting Remain
Let’s now look at the relationship between earnings and voting patterns in the EU Referendum by region.
remain_earnings_region <- ggplot(Brexit_data, aes(x=Earnings_Mean, y=Percent_Remain, colour=Region)) + theme_bw() + # Remove gray background (i.e. make black & white) geom_point() + # Use filled in circles by not selecting a shape geom_smooth(method=lm, # Add line of best fit se=FALSE) + # Don't add shaded confidence region scale_y_continuous(limits = c(20, 80)) # set y-axis range remain_earnings_region <- ggplotly(remain_earnings_region) # Make the plot interactive remain_earnings_region # View the plot
From this figure we can see that there is some regional variation in the relationship between average earnings and voting patterns (e.g. compare Yorkshire and Wales). Scotland and London have a larger propensity to vote remain overall, but some other regions (e.g. Yorkshire and the Humber) had a steeper slope (every increase in earnings is associated with a greater increase in voting remain). Importantly, the relationship is positive in all of these regions.
3.3 Education and Voting remain
Lastly, let’s look at the relationship between education and EU Referendum voting patterns, broken down by region.
remain_education_region <- ggplot(Brexit_data, aes(x=Bachelors_deg_percent, y=Percent_Remain, colour=Region)) + theme_bw() + # Remove gray background (i.e. make black & white) geom_point() + # Use filled in circles by not selecting a shape geom_smooth(method=lm, # Add line of best fit se=FALSE) + # Don't add shaded confidence region scale_y_continuous(limits = c(20, 90)) # set y-axis range remain_education_region <- ggplotly(remain_education_region) # Make the plot # interactive remain_education_region # View the plot
There is the least regional variation in the relationship between the percentage of residents with a bachelor’s degree and voting remain (notice how the points are largely clustered together). However we do see some differences. In Scotland, education is less strongly correlated with voting patterns than in London for example (compare the slopes of the lines).
We can also see that in London, there are generaly more people with a bachelor’s degree than elsewhere in the country (and as we saw from the previous figures, London also has on average younger residents who earn more than residents of other areas). To better understand which explanation is the most useful, we would need to go beyond using simple scatter plots to visualize correlation. With scatterplots while we get part of the story, we cannot statistically “control” for other explanations (i.e. we can’t explain the relationship between age and voting outcomes that is in addition to and separate from any relationship between earnings, and education and voting outcomes). In fact, age and education, age and earnings, and education and earnings are all likely to be correlated with each other (although their mean values, which we use for our analysis may not be).
While it can be useful to look at how demographic data relates to voting outcomes in the EU Referendum or other elections, it is important to dig deeper and ask why these patterns exist and if anything important isn’t being analysed. Why for example might earnings be a good predictor of voting? Could data on earnings be capturing other explanations we cannot easily measure (e.g. the number of times a person has visited other EU countries on holiday)?
As data and data analysis increasingly surround us in our everyday lives, it is important to become familiar with its uses and its necessary oversights. To return to the title, generally the numbers we are presented with in the media are not fake, but we do still need to take the time to decide how much we can conclude from any given data.
5 Variable Cheatsheet
CENSUS DATA 2011/ANNUAL SURVEY OF HOURS AND EARNINGS 2016
Earnings_Median Median value of gross pay for full time workers, before tax, National Insurance or other deductions
Earnings_Mean Mean value of gross pay for full time workers, before tax, National Insurance or other deductions
Bachelors_deg_percent Percent of all usual residents aged 16 and over with a bachelors degree
age_mean Mean age of usual residents
age_median Median age of usual residents
While I have done my best to avoid mistakes in the data, these may still exist either from user error or errors in the underlying data.