Normality Tests

13 min readMar 31, 2020

In statistics, many statistical tests and procedures are based on specific distributional assumptions. The assumption of normality is particularly common in classical statistical tests. Normality tests are used to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.

My journey for knowing more about the normality tests is begun by asking myself the question “Why there are so many normality tests?”

Eventually, I was curious to know the answers to these questions too.

Why can’t we make a decision about the normality of our sample by just checking the sample mean, median, mode, variance, skewness, kurtosis?
For what criteria we should look at while choosing a normality test?
What is the best test for normality?

One can get answers to the above questions here. I will show how to conduct different normality tests in python and make notes on their advantages and limitations. I will keep updating this post whenever I learn a new normality test and try to make notes on its limitations and advantages. Let’s explore the methods to check normality. If someone wants to experiment with the code in this content on the fly consider this Github [link](https://github.com/ThisIsVenkatesh/Normality-Tests/blob/master/Normality%20Tests.ipynb)

1. Graphical Methods (Visual checks)

1a. Histogram, KDE

As we know the normal distribution is symmetric and bell-shaped. A simple method to test the normality is to observe the distribution of the sample data by the histogram. Let’s draw the histogram of a random variable in python.

import numpy as np 
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import seaborn as sns

Imported necessary packages

from math import sqrt
n=300 # Sample size
data=np.random.uniform(low=1,high=10,size=n) # Sample
#data=np.random.normal(loc=0,scale=10,size=n)
bin_value=int(sqrt(len(data))) #choosing the number of bins to draw histogram
sns.distplot(data,bins=bin_value);
plt.xlabel("Sample Data",size=15);
plt.ylabel("Density",size=15);
plt.title("Histogram",size=20);
plt.show()

Histogram of the sample data — Histogram of a variable

From the above histogram, we can clearly say that the sample is not normally distributed.
But based on histogram, we can’t always be sure to judge that the sampled data is normally distributed or not, even when the histogram gives a signal that it is normal. The science of data was not allowed to do so. Click here to know why.
One can consider the KDE(the line drawn in the plot), as an alternative(which reduces the arbitrariness of histograms) to test the normality for the given sample.
KDE also not preferred to test the normality when the sample size is small.
In practical situations, one can’t be come to the conclusion about the normality by just checking the KDE/histogram. Check this for more information.
But this is the most important step to choose an appropriate normality test as it describes how the sample data is distributed.

1b. QQ Plot

A Q–Q plot can be viewed as a non-parametric approach to comparing underlying distributions of two samples of data. A Q–Q plot is generally a more powerful approach to do a normality test than the common technique of comparing histograms of the two samples. Q–Q plots are also commonly used to compare a data set to a theoretical model by plotting their quantiles against each other. For the normality test, we compare our ordered sample data with the quantiles obtained from the standard normal distribution. Thus, Q–Q plot helps us to identify substantive departures from normality. The resulting image look close to a straight line if the data are approximately normally distributed. Deviations from a straight line suggest departures from normality. The main step in constructing a Q–Q plot is calculating or estimating the quantiles to be plotted. To know how the quantiles are generated in scipy.stats.probplot() see this notebook.

Note: The normal probability plot is a special case of the Q–Q probability plot for a normal distribution.

This method is not preferred if the sample size is small.
If the sample size is sufficiently large most statistical significance tests may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, like a Q–Q plot. So, if we have a sufficiently large sample, we can opt for this method instead of statistical significance tests. For more information check this link and this link.
As it needs visual inspection, if we don’t have enough samples, it is trickier to make conclusions.

2. Statistical tests to test the normality

Note: To understand the following concepts clearly, knowing about some inferential statistical terms like test statistic, null distribution, power, p-value, critical values will help a lot. Here I didn’t provided the formulas of the test statistic. To know about the formula for a specific test statistic please refer to their Wikipedia page.

2a. Shapiro–Wilk test

The Shapiro–Wilk test tests the null hypothesis that a sample x1, …, xn came from a normally distributed population. The Shapiro-Wilk test is for testing normality with unspecified μ and σ, i.e., the sample is from a normal distribution with unknown mean μ and unknown SD σ. This test exhibiting high power( probability that the test rejects the null hypothesis (H0) when a specific alternative hypothesis (H1) is true.), leading to good results even with a small number of observations.

H0 : X ∼ N(μ, σ)

H1 : X ≁ N(μ, σ)

Test Statistic: The test statistic is obtained by dividing the square of an appropriate linear combination of the sample order statistics by the usual symmetric estimate of variance. It may be noted that if one is indeed sampling from a normal population then the numerator b² and denominator S² of W(test Statistic) are both, up to a constant, estimating the same quantity, namely σ. let’s see how to implement this in python.

Initially, this test is feasible to apply for only small samples(<=50). But after Royston proposed an alternative method of calculating the coefficients vector by providing an algorithm for calculating values, which extended the sample size to 2,000. This technique is used in several software packages. Rahman and Govindarajulu extended the sample size further up to 5,000.
In python, this test is not recommended to use if the sample size is above 5000. For N > 5000 the test statistic(W) is accurate but the p-value may not be. (The accuracy of the p-value depends on how close the assumed distribution is to the true distribution of the test statistic under the null hypothesis.)
This test is not recommended to use if the data set has many identical values.
Shapiro–Wilk has the best power under a broad range of useful alternatives for a given significance, when comparing with Anderson–Darling, Kolmogorov–Smirnov, Lilliefors tests.
Recommended when we don’t have a particular alternative distribution in mind.

2b. Kolmogorov-Smirnov test and Lilliefors test

The Kolmogorov-Smirnov (K-S) test is a non-parametric test based on the empirical distribution function (ECDF).

There are two standard versions of the Kolmogorov-Smirnov test:

The one-sample KS, which tests if a sample of points X1,…,Xn∈R fits a specific continuous distribution function F.
The two-sample KS, which tests whether it is reasonable to assume that two sets of samples X1,…,Xn and Y1,…,Ym come from the same continuous distribution.

So, The Kolmogorov–Smirnov test can serve as a goodness of fit test(one-sample case). In the special case of testing for normality of the distribution

H0 : X ∼ N(μ, σ) with specified μ, σ

H1 : X ≁ N(μ, σ)

In general, samples are standardized and compared with a standard normal distribution(i.e., with specified mean(0) and variance(1)).

Test statistic: The Kolmogorov–Smirnov statistic measures the supremum (greatest) distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution(standard normal distribution in our case for testing normality). If the sample comes from the referenced distribution F(x), then the test statistic converges to 0 almost.

The null distribution of this statistic converges to the Kolmogorov distribution.

The goodness-of-fit test or the Kolmogorov–Smirnov test can be constructed by using the critical values of the Kolmogorov distribution. This test is asymptotically valid when n tends to infinity.

Standardized data compared with N(0,1) using K-S test

Obtaining the location of supremum distance

In general, the population parameter will not be known. If either the form or the parameters of F(x) are determined from the data Xi itself then the critical values determined in this way are invalid. In such cases, Monte Carlo or other methods may be required, but tables have been prepared for some cases.

In that case, we need some modifications. Details for the required modifications to the test statistic and for the critical values for the normal distribution and the exponential distribution have been published. The Lilliefors test represents a special case of this for the normal distribution.

Lilliefors test:

The Lilliefors test is a normality test based on the Kolmogorov–Smirnov test. It is used to test the null hypothesis that data come from a normally distributed population, when the null hypothesis does not specify which normal distribution; i.e., it does not specify the expected value and variance of the distribution.

Instead of comparing the standardized data with the standard normal distribution, we will compare the sample data by the normal distribution with the estimated mean and estimated variance. The test statistic will be the same that used in Kolmogorov–Smirnov test.

This is where this test becomes more complicated than the Kolmogorov–Smirnov test. Since the hypothesized CDF has been moved closer to the data by estimation based on those data, the maximum discrepancy has been made smaller than it would have been if the null hypothesis had singled out just one normal distribution. Thus the “null distribution” of the test statistic, i.e., its probability distribution assuming the null hypothesis is true, is stochastically smaller than the Kolmogorov–Smirnov distribution. This is the Lilliefors distribution. let’s see how to implement this test in python.

For KS test, the asymptotic power of the test is 1.
Perhaps the limitation is that the distribution must be fully specified.
KS test can be appropriate to test the distribution of errors is normally distributed with mean 0 or not, to make decisions about the process.
It tends to be more sensitive near the center of the distribution than at the tails.(That is why it is better to know where the supremum distance is located). If there are repeated deviations between the EDF’s, or the EDF’s have (or are adjusted to have) the same mean values, then the EDF’s cross each other multiple times and the maximum deviation between the distributions is reduced. The Cramer-von Mises (CvM) test that measures the sum of square deviations between the EDF’s treats this case well.

2c. Cramer–von Mises test and Anderson–Darling test.

Cramér–von Mises test: To simply describe, In KS test only the maximum distance is considered as test statistic. But, In Cramér–von Mises test, the test statistic will be based on all the deviations and it is sum of squares of the deviations. Empirical evidences suggest that the Cramér–von Mises test is usually more powerful than the Kolmogorov–Smirnov test for a broad class of alternative hypothesis. Let’s see how to conduct this in python.

But both the KS and CvM statistics are insensitive when the differences between the curves is most prominent near the beginning or end of the distributions. This is because, by construction, the EDFs converge to 0.0 and 1.0 at the ends and any deviations must be small. To know more about this click here.

The Anderson-Darling (AD) test was developed in the 1950s as a weighted CvM test to overcome this problem.

Anderson-Darling (AD) test:

Compared with the Cramér–von Mises test, K-S test the Anderson–Darling distance places more weight on observations in the tails of the distribution. If we calculate a weighted version of the Cramér-von Mises (with weights inversely proportional to this variance) then we end up with the Anderson-Darling statistic. The K-S test is distribution free in the sense that the critical values do not depend on the specific distribution being tested. The Anderson-Darling test makes use of the specific distribution in calculating critical values. The Anderson-Darling test is an alternative to the chi-square and Kolmogorov-Smirnov goodness-of-fit tests. We can use the Anderson-Darling statistic to compare how well a data set fits different distributions. Kolmogorov-Smirnov test taken the maximum difference between the EDF curves, in Anderson-Darling test will consider all the differences. Overall Anderson-Darling test is more powerful than Kolmogorov-Smirnov test because of the more detailed comparison it does. let’s see how to implement this in python. Code taken from here

This test can be efficient for small samples(n>8) too.
You can adjust the Anderson-Darling test statistic under the estimation of parameters from the sample for testing of normality
If we observe the distribution of sample has the heavier tail, this is considered to be best test.

Final Notes :

Summary of Test Results

One of the reason to create this notebook is most of the analysts are recommended to run more than one test whenever we are looking for normality tests. Let's see the results of different normality tests for our sampled data.
There are few other tests based on sample skewness and kurtosis, entropy etc. I didn’t explained them here. I will keep updating this notebook whenever I learn a new normality test and try to make notes on its limitations and advantages. let’s summarize our results.

The cell in red color means the test reject our null hypothesis.

The cell in green color means the test fail to reject the null hypothesis.

Remember we have taken n=300 from uniform distribution, try with different n values and different samples.

However, It is completely possible that for p > alpha and the data does not come from a normal population. Failure to reject could be from the sample size being too small to detect the non-normality. So, keep this in mind when interpreting the results. Click here, and here for deep understanding.

I didn’t mentioned about the effect of outliers in the sample data. Outliers will affect any normality tests to some extent. I didn’t taken any outliers in the sample.

Questions

Q1. Why can’t we make decision about the normality by just checking the sample mean, median, mode, variance, skewness, kurtosis?

The simple answer is they don’t completely characterize the data. For more information click here

Q2. Why do we have so many number of normality tests?

Because the tests are based on different characteristics of the normal distribution and the power of these tests differs depending on the nature of the non normality.

Normality tests differ in the characteristic of the normal distribution they focus in, such as its skewness and kurtosis values, its distribution or characteristic function, and the linear relationship existing between a normally distributed variable and the standard normal z. The tests also differ in the level at which they compare the empirical distribution with the normal distribution (compare and summarize vs. summarize and compare), in the complexity of the test statistic and the nature of its distribution (a standard distribution or specified one).

Q3. For what criteria we should look while choosing the normality test?

1. Parameters of the normal distribution which you want to compare with the sample data.
2. Sample size
3. Which characteristic of the normal distribution you want to test with
4. Power

All these are discussed above whenever needed.

Ex1:

If you are not interested in the parameters of the normal and just want to simply determine if the distribution is normal or not S-W test, A-D test, Lilliefors test for testing normality with normal distribution unknown mean and variance is preferred.
If you are interested to compare with in specific parameters(Ex: In general we expect the error distribution ~ N(0,σ)) one can opt the KS test,AD test.

Ex2:

SW test was preferred for small samples.
AD test was preferred if we have enough sample size.
The asymptotic power of KS test is 1.
If we have sufficiently large sample QQ plot is preferred.

Ex3:

If we observe the the distribution of sample has the heavier tail, AD test will preferred.
If we observe the the distribution of sample was skewed(i.e., The empirical distribution is summarized through its skewness and kurtosis statistics and compared to the skewness and kurtosis of the normal distribution.) tests based on skewness and kurtosis(Jarque–Bera test) are preferred.
For the distributions that have slightly or definitely higher kurtosis than the normal, the skewness-kurtosis based tests are more powerful than the other types of test.

Ex4: Note: The power depends on the way in which the null hypothesis is false(i.e, Depends on how you defined your alternative hypothesis)

Suppose that the hypothesis H0 : X∼N(μ, σ) is verified using the SW test.

Three kinds of alternative hypothesis can be considered:

a) X∼N(μ, σ) with μ ≠ μ0

b) X is not normal with μ = μ0

c) X is not normal with μ ≠ μ0.

The null hypothesis is tested against with entirely different distribution also(lets say the x follows the Weibull distribution)

There are lots of studies available in the internet on the power of the test under different conditions.

In most of the software packages the the alternative hypothesis is simply X≁N(μ, σ) as like the methods we discussed here.

Q4. What is the best test for normality?

We have seen each of the common normality test is most powerful at some conditions.

Instead of asking for best normality test think of this question “Why are you testing normality?” Because what is more important is to consider what it means when these tests of normality reject the null, or fail to reject the null.

Why is normality important in your application/analysis? With that information you can able to choose the appropriate test for your analysis.

Check this and this. Definitely your perception towards the best test for normality will change.

The results of a test for normality should not only report a p-value but they should be accompanied by a careful interpretation of the probability plot and skewness and kurtosis statistics for a complete diagnosis.