top of page

Coding for Visualizations: Regression Analyses and Chloropleth Maps

The coding steps for visualisations are separated below, yet there is some overlap. Following data cleaning and translation we were able to start thinking about visualisations, but this involved more cleaning. Listwise deletion was used twice to resolve problems encountered in the translation phase: first, listwise deletion was done in python to be able to transform the data into QGIS (to remove Counties, and keep only Local Authorities), and second in QGIS itself,  to remove any Local Authorities that were not contained in both the pollution and the happiness data.

Choropleth Maps

Step 1: Separating the data into its local authority for the map visualisation and North/South divide.

We had to label each piece of data according to which county it was in to be able to show it on the choropleth map later on. This was also useful because it tackled one of our subquestions,

How does inequality present itself between the North and South and England?

 

We researched the map and applied the new knowledge to the data by labelling each row and header with its approriate local authority name to create further subsets of our data. 
 

north south separation.jpg

Step 2: Cleaning up the North and South of the data to combine them

We used the same 'replace x' method for each of the separated hapiness dataests and combined them in "England_happiness" with the concat function.

 

10 November

cleaning north and south and combining them.jpg

We then collated all the pollution pm10 data from DEFRA and repeated the same process for the pm25 data. 

 

pm10 clean and collate.jpg

We added the data all to QGIS, summarising all attributes by location, and deleting all data that was not available in all datasets so that we could run regression and create graduated maps (choropleth maps) for the joint data. All data used for further visual analysis were CSV files downloaded from the data aggregated by location. These were 300 locations all with means for happiness, pm10 and pm2.5 for each data point. Summary statistics are available for these datasets by clicking on the button below.  

​

Now that there was only data for England, we were able to produce visualizations by using a graduated marking system for which we used automatic classification of the data by the QGIS software. This was repeated for each year, producing a visually comparable map with happiness and pollution data from 2011 to 2021. 

Regression Plots: initial plot

First we imported the relevant libraries.

After the dataframe was created with our data, we used SciKit Learn to perform the regression, using python code from SciKit Learn’s documentation and this Stack Overflow question and adapted it to our needs. We followed these instructions from Plotly to customise the plot.

Finally, following instructions found on this website, we coded for the regression statistics to show when the program runs. Exactly the same process was applied for all regression plots. 

Regression Plots: facet plots

Again, the relevant libraries were added.

Using instructions for scatter graphs and facet plots from Plotly’s graphics library, the following code was created.   

When this code was run, we were able to produce an interactive facet plot. However, some tidying up was needed to make the graph legible. For example, the x axis title was unnecessarily repeated for each subplot, making it difficult to read. For this reason, we applied code from this Stack Overflow chain which allowed us to tidy the x axis.

Correlation Matrix

For the correlation matrix we used this resource to create the graph. 

Corr matrix.jpg
bottom of page