## The (false) problem of many zeros in the data

## To develop research and analyses, we need data.

Data is the “holy grail” of data scientists, and, in an era of the Internet of Things, we have never had as much data as we have today. Unfortunately, sometimes data doesn’t arrive in a tidy and usable way. Consequently, we need to do some data cleansing and feature engineering before putting our data in a digestible format. However, maybe, what is more important is to understand the data.

In this blog, UBDC Data Scientist Luis Serra presents an example of data that was not initially understood and explains how he worked to understand it and overcome his challenge.

### The challenge

In one project, that was being supported by UBDC’s Data Science Team, it was necessary to run an Exploratory Factor Analysis (EFA) to identify latent variables which could not be identified directly. For this analysis to run smoothly, and to not extract ‘artefactual factors’ (factors that load heavily on a single variable), it was recommended to keep absolute skewness and kurtosis below 2.0 (Bandalos and Finney 2010). Well, here comes the first challenge: The data was highly skewed and even after applying the traditional range of transformations (square root transformation, cubic root transformation and log transformation), the researchers were unable to transform the skewed data into a normal distribution or to bring all values of skewness and kurtosis within or very close to the desired limits. This is the point where my support was requested:

*“I can’t run my statistical analysis (EFA) because many of my variables are highly skewed!”*

### Understanding the data

The data for this project consisted of 28 socio-economic variables drawn from a census and 28 spatial variables drawn from a wide range of sources. The skewness problem arose in the spatial variables. Before diving into the problem, it is important to briefly explain the spatial data to better understand the challenge at hand. This data corresponded to different types of infrastructure ranging from train stations and bus stops to libraries and schools, which were spatially connected to Iris polygons. Iris polygons are aggregated units for statistical information with 2000 residents, in France. These polygons, which cover the whole of the French territory, are comparable to the Scottish data zones. As you may already suspect, in the rural areas there were less infrastructure than in the cities and most polygons covered rural areas. For instance, regarding the distribution of libraries by Iris polygons (Figure 1), one can see that there were no libraries available for most of the Iris polygons.

Figure 1: Distribution of libraries by Iris polygons in France.

After careful thought, I came up with the idea of interpolation. Interpolation is an estimation of the value of a point in an arbitrary position taking into account the values of points in known positions, for the variable in question. In essence, I was proposing to recalculate the value of the variables for all polygons based on the values of the neighbouring polygons. There are many (deterministic) interpolation methods and the one that would best suit this approach (I thought at that moment) was the inverse distance weighted interpolation (IDW), with the formula:

Where:

Ẑ_{i} is estimative of the value at polygon i

Z_{j} is the value at polygon j

d_{ij} is the distance between i and j

n is the power of the distance

This method is a weighted mean, in which the weights are the inverse distance between polygons (or the centroid of the polygons, to be more precise). The whole idea is that the nearby polygons will be more heavily weighted than polygons further away. The value of n can be adjusted to reflect the degree of influence of close polygons. A large n results in nearby polygons wielding a much greater influence on the unsampled polygon compared to a polygon further away. For instance, one could argue that train stations in neighbouring locations are far more important than train stations located some distance away.

I was pretty sure of my approach to the problem. I had already tested this method and others to measure the positional accuracy of surface digital models since I have a background in Mapping and GIS.

What do you think? This approach seems nice, right? Well, not exactly…

### The “shaking” of my belief

I asked the opinion of Dr David Mcarthur, a researcher at UBDC, and his comment was:

*"Firstly, I would suggest not motivating the work as dealing with a problem of excess zeros. I would suggest that what you are interested in is access to facilities and that counting a polygon as having access only if it has a facility present in that particular polygon is a bad measure of accessibility."*

Well, I had to agree with David, in the sense that he framed the problem with the right words. The problem at hand was not a problem of excess zeros but rather a problem of accessibility to facilities or infrastructure. This comment made me realise that we (the researchers and I) were looking at the problem from the wrong perspective. We wanted the variables to follow a Gaussian distribution to accomplish our analysis, rather than trying to understand what was behind the data recording so many zeros and how we should handle this. For example, if the variable “libraries” recorded no zeros but instead the value “1” or more for each Iris polygon, this would mean that France had at least one library per roughly 2000 inhabitants, leading to probably the highest rate of libraries in the population across the world!

### A better approach

To accomplish this task David proposed to use a gravity-based measure, with the following formula:

Where:

A_{i} is the accessibility index of polygon i

F_{j} is the number of facilities in polygon j

d_{ij} is the distance between i and j

σ is a parameter measuring the strength of the distance deterrence effect

The distance to be measured is the Euclidean distance between the centroids of the polygons.

Furthermore, David criticized the use of an interpolation method because *"interpolation is intended to be used for cases where a surface of values exists, but you have only sampled some points. The job is therefore to estimate the entire surface from the sampled points"*. Well, this is true, but one can only determine one point of the surface and not the entire surface (although conceptually the surface is there).

I wasn’t yet convinced of David’s suggestion to use a gravity-based model, which I hadn’t heard of before, but then I started to reconsider the suitability of my approach:

*Does it make sense to interpolate the polygons with zero values and reach a value such as 0.0007 libraries for a particular polygon, for instance?*

*Isn’t it enough to just order all polygons according to their proximity to facilities?*

In fact, yes, for the objective of this analysis, which was to feed an EFA, it was enough to order the polygons according to their distance to facilities. In other words, to create an index of accessibility. It did not matter the value of the polygons since they were ordered. In line with this assumption, I could discard the division for the sum of the weights in the interpolation formula and in this case the interpolation formula equalled the gravity-based formula proposed by David!

I ended up with the index of accessibility of Iris polygons using the gravity-based measure proposed by David.

In the absence of data to estimate the sigma parameter, I assumed the value of one. Furthermore, to avoid dividing by zero, I added the value of 0.1 to the distance, d_{ij}. Adding the value of 0.1 to the distance is also motivated by the fact that even within an Iris zone there will likely be some travel time involved if a person travels from their home to some facility. It is worth mentioning that other options are available to model intra-zonal trip distances, although the one that I used was the simplest to fit the needs of the challenge.

The index sorted all Iris polygons according to their proximity to facilities and infrastructure. The polygons closer to facilities recorded a higher score whereas the ones further away recorded a smaller score. In the end, there were 28 scores, as much as the number of facilities/infrastructure (variables).

The calculation was processed using the centroids of the Iris polygons and the distances among them. Python with Pandas library was used to perform the calculations alongside QGIS to determine the location of the Iris centroids.

### Conclusion

This wee story aims to illustrate that convictions in science should never be allowed to dominate one’s thinking but also that we should question the approaches others suggest. Another takeaway is that people with different backgrounds can reach the same conclusions if they are willing to ask questions and follow a scientific approach. Lastly, this story also reflects some of the challenges that constantly face the Data Science Team at UBDC.

### Acknowledgements

I would like to acknowledge Hugo D'Assenza-David and Professor Simon Joss, the researchers who put this challenge to me; Dr David McArthur, who helped me to understand the challenge and finally, Dr Andrew McHugh who challenged me to write this blog. Furthermore, Dr David McArthur and Dr Andrew McHugh were of great help by reviewing the text.

As a Data Scientist, Luis is responsible for supporting UBDC and its stakeholders in a vast range of data analysis. He works with disparate sources of data to extract valuable information.