Data Cleaning

The data cleaning and translation processes overlapped to create the choropleth maps, which in turn informed the regression and the correlation techniques. For cleaning, we mostly used python, yet QGIS was also used at the end of the process to aggregate data by Local Authority. It is from this first round of cleaning that we were able to aggregate the data by Local Authority (and separate across the North-South divide) and use this to produce correlation and regression.

Cleaning for Choropleth Map

Step 1: Importing libraries and reading all downloaded files into dataframes

25 October

The libraries were imported from material covered in the workshops.

Step 2: Replacing missing values in happiness and pollution data

27 October

The missing data labelled "x", "u" and "MISSING" were replaced with a zero integer and the process was repeated for both the happiness and the pollution datasets. This left a lot of null values in all datasets. The column names for the happiness and pollution data were changed to make the column names more intuitive and clear.

A snippet of the code used

The missing data required a lot of work because we had to make the decision of whether we should keep only the data which is complete for all years (listwise deletion) or try to implement an imputation method such as k-nearest neighbour (kNN) or multiple imputation by chained equations (MICE) which would require machine learning and a lot more time but would give a more accurate and reliable result.

We chose listwise deletion for the time being and continued with the cleaning of the data. This code is only a snapshot of the full code showing how listwise deletion was run. Any rows with a 0 in named columns were deleted.

Data before cleaning

Justification of Listwise Deletion:

One of the biggest challenges in cleaning the data was deciding how we would approach the missing data. We decided to use imputation---the 'filling in of the missing values' (Rässler, et al., 2012) as this method is flexible and can be applied simply to the linear regression.

Data for which all years have missing values (aka a row full of 0s) has an equal probability for any given year to have missing data (Allison, 2002). Therefore, “Specifically, if the data are missing completely at random (MCAR), then the reduced sample will be a random subsample of the original sample” (ibid). In this instance, listwise deletion may be applied to our project as all the values were missing (see above).

For singular missing values, there was enough complementary data that all local authorities still had mean pollution values, which were unique to them (but all still had more than 26 unique data points for each local authority, with some having over 200).

The best way to deal with missing pollution data as found by Plaia & Bondi in 2006 was to use MICE (multiple imputation chained equations) because they account for the uncertainty within each imputation, however, we would need much more data to have a trained dataset which could then give accurate estimations for missing values. This was outside of the scope of our study.

According to Alahamade et al, the next best way would be to run a k-value nearest neighbour analysis. However, listwise was satisfiable for pollution and using nearest for the happiness data set which is already in means and covers larger areas as they are categorised by local authorities would yield unreliable results and this was also outside of the scope of our study. Therefore, we went for the slightly biased but still satisfiable solution.

Regression Cleaning

Data cleaning for regression plots

The regression process required the data to be in two columns, which meant the data had to be cleaned again. The first image shows a sample of the PM2.5 data and the happiness data for each year in all of England. The mean of each row (representing the mean of all the years) was taken for both datasets. From this, a separate .csv file containing two columns of data could be created for the python regression (see picture 2). Two main issues emerged at this stage. Firstly, there was a far greater number of data for air pollution than happiness, approximately 250,000 and 300 respectively. This meant that when the data frame for these data was made, there were many NaN values. Imputation only repeated the same value for each NaN, so the decision was made to simply delete rows containing missing data. To see the source of the code we used to attempt data imputation, click here; and our source for more general code and information about missing data values, click here. The second issue was the unavailable location data for the PM measures. Because of this, there was no way to ‘match up’ the happiness data and the air pollution data in the same location. Therefore, simply deleting all the rows with missing data could lead to bias if the first 300 rows of air pollution data were measured, for example, within a single local authority. To counter this and attempt to control for location, we included a line in the regression code that randomly shuffled the column containing the air pollution data in the data frame. The code for this was adapted from the code found here. This, of course, is not perfect, but it provided an achievable way around an issue that could significantly skew our results. For an extract of our code that cleaned the data, see picture 3.

Picture 1: data pre-cleaning

website cleaning data for regression example of pm25 data pre clean.PNG

Picture 2: data after cleaning

Picture 3: code used for cleaning

Data cleaning for the facet plot

Similarly, the happiness data required further cleaning before it could be used to make the scatter plot. The images below show the data before and after cleaning.

Before cleaning

wesbite cleanign data for scatter plot happiness pre clean.PNG

After cleaning

website cleaning data for big scatter.PNG

Next Stage: Data Translation