Asymmetric data is the problem with your statistical model
First of all: what is asymmetric data? We call skewed data when the curve appears distorted to the left or to the right in a statistical distribution. In a normal distribution, the graph appears symmetrical, which means that there are as many data values to the left of the median as to the right.
What is asymmetric data?
We know that the data is skewed when the statistical distribution curve appears distorted to the left or to the right.
Let’s look at this height distribution graph as an example:
Here you can see that the green graph (males) has symmetry at around 69, and the yellow graph (females) has symmetry at around 64. So this means that most of the males in this dataset have a close height. of 69, and most of the females are close to size 64. Then there are a few males that are close to 75 and 63 sizes and females that are close to 68 and 58 sizes.
In the case of a normal distribution, the mean, the median and the mode are close together. These three metrics are all data center metrics. We can determine the skewness of data by how these quantities relate to each other.
Asymmetric data on the right (or positively)
An asymmetric distribution on the right has a long tail that extends from the right or positive side of the xaxis, as you can see in the graphic below.
Here you can see the positions of the three data points on the plot. So you see:

The average is higher than the mode.

The median is greater than the mode.

The mean is higher than the median.
While the mean and median will always be greater than the mode in a skewed distribution on the right, the mean may not always be greater than the median.
Let’s look at some real world examples.
You can see that this is skewed data on the right with its tail on the positive side of the distribution. Here, the distribution tells us that most people have incomes of around $ 20,000 per year and that the number of people with higher incomes decreases exponentially as we move to the right.
Now take a look at the following distribution from 2002 General social survey. Respondents indicated how many people over the age of 18 lived in their household.
Here, the distribution is skewed to the right. Although the mean is usually to the right of the median in a right skewed distribution, this is not the case here.
Asymmetric data on the left (or negatively)
An asymmetric distribution on the left has a long tail that extends to the left (or negative) side of the xaxis, as you can see in the graphic below.
Here you can see the positions of the three data points on the plot. So you will find:

The average is higher than the mode.

The median is greater than the mode.

The mean is higher than the median.
While the mean and median will always be greater than the mode in a skewed distribution on the right, the mean may not always be greater than the median.
Let’s look at another real world example.
Here, the cast tells us that most people die by the age of 90 (fashion). The average life expectancy would be around 75 to 85 years (average). In the above distribution, you can see a small spike at the very beginning that indicates that there is a small percentage of the population that dies during birth or during infancy. This population acts as a outlier in our distribution.
My data is falsified. So what?
Realworld distributions are generally skewed, as we see in the examples above. But if there is too much skewness in the data, then many statistical models do not work effectively. Why is that?
In skewed data, the tail region can act as an outlier for the statistical model, and we know that Outliers negatively affect the performance of a model, in particular regression based models. While there are statistical models that are robust enough to handle outliers like tree models, you will be limited in what other models you can try. So what are you doing? You will need to transform the asymmetric data so that it becomes a Gaussian (or normal) distribution. Removing outliers and normalizing our data will allow us to experiment with more statistical models.
Journal transformation
Journal transformation is a data transformation method in which we apply a logarithmic function to the data. It replaces each x value with log (x). A logarithmic transformation can help to adapt a very skewed distribution to a Gaussian distribution. After the transformation of the log, we can see patterns in our data much more easily. Here is an example :
In the figure above, you can clearly see the patterns after applying the log transformation. Before that, we had too many outliers, which would negatively affect the performance of our model.
If we have skewed data, it can skew our results. So, to use skewed data, we need to apply a logarithmic transformation on the set of values to discover patterns in the data and allow lessons to be learned from our statistical model.
This article was originally published on Towards data science.