Data Collection

A snippet of PM10 data in its individual sheet form (with the UK Gridcode, coordinates and gravimetric units

Initial Data Collection:

After settling on the ONS and DEFRA data, we downloaded these into a local file which we then performed the cleaning of the data on. The happiness data was all located in a singular sheet within an excel file which we downloaded in a csv format. The Pollution data had individual csv files for each year.

Our final goal was to have three data sets with comparable data for PM10, PM2.5 and Happiness means.

We used numpy and pandas for this.

A snippet of happiness data within the master sheet, separated by Local Authority (seen here under 'NORTH EAST' and 'NORTH WEST'

A snippet of happiness data, cleaned into two sheet: this csv file contains all happiness data for every Local Authority in the South of England

Happiness

The happiness data was categorised by local authorities and counties within each region of the country.

We uploaded the Happiness means csv files, having different dataframes for each region. This would allow us to split the data between North and South as we had the indicative regions for each from a shapefile downloaded on NHS Regions from the ONS OpenGeography Portal. The column names needed to be consistent throughout all dataframes to make the data more easily readable and so we could merge and concatenate data by column names.

The happiness was then concatenated into north and south. All missing data was replaced with the integer, “0”. A key difference in the data cleaning here, compared to the workshop was that regex was False as we did not want the missing values to be considered substrings and replace Local Authority Names with this value.

We used listwise deletion (the next stage, which we explain in more detail here). Therefore, if there was a missing value in any column within a given row, the whole row was deleted.

The clean datasets were redownloaded as csv files to be uploaded to a shared group folder.

PM10 & PM2.5

A snippet of the cleaned PM2.5 pollution data, aggregated by year, containing coordinates and UK Gridcode

The pollution data had many more data points than the Happiness data and multiple csv files which had to be combined into a singular data frame. It was also categorised by x and y coordinates within the British National Grid system and a singular identifiable UK Grid Code.

We saved all the files for PM10 and PM2.5 into folders with their names respectively which we used as a global file path to automate the reading of each file within pandas. These were all concatenated into a dataframe, providing two dataframes- one for PM10 and one for PM2.5.

Once concatenated, the data was grouped by UK Grid Code. These were cleaned and any blank cells or missing datum cells were given the value of 0. Listwise deletion was completed for these and clean datasets for each variable were saved as csv files to make accessible to the whole group.

Step 2: Data Cleaning