Data summarization provides a convenient way to describe all of the values in a data sample with just a few statistical values.
The mean and standard deviation are used to summarize data with a Gaussian distribution, but may not be meaningful, or could even be misleading, if your data sample has a nonGaussian distribution.
In this tutorial, you will discover the fivenumber summary for describing the distribution of a data sample without assuming a specific data distribution.
After completing this tutorial, you will know:
 Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
 The fivenumber summary can be used to describe a data sample with any distribution.
 How to calculate the fivenumber summary in Python.
Let’s get started.
Tutorial Overview
This tutorial is divided into 4 parts; they are:
 Nonparametric Data Summarization
 FiveNumber Summary
 How to Calculate the FiveNumber Summary
 Use of the FiveNumber Summary
Need help with Statistics for Machine Learning?
Take my free 7day email crash course now (with sample code).
Click to signup and also get a free PDF Ebook version of the course.
Download Your FREE MiniCourse
Nonparametric Data Summarization
Data summarization techniques provide a way to describe the distribution of data using a few key measurements.
The most common example of data summarization is the calculation of the mean and standard deviation for data that has a Gaussian distribution. With these two parameters alone, you can understand and recreate the distribution of the data. The data summary can compress as few as tens or as many as millions individual observations.
The problem is, you cannot easily calculate the mean and standard deviation of data that does not have a Gaussian distribution. Technically, you can calculate these quantities, but they do not summarize the data distribution; in fact, they can be very misleading.
In the case of data that does not have a Gaussian distribution, you can summarize the data sample using the fivenumber summary.
FiveNumber Summary
The fivenumber summary, or 5number summary for short, is a nonparametric data summarization technique.
It is sometimes called the Tukey 5number summary because it was recommended by John Tukey. It can be used to describe the distribution of data samples for data with any distribution.
As a standard summary for general use, the 5number summary provides about the right amount of detail.
— Page 37, Understanding Robust and Exploratory Data Analysis, 2000.
The fivenumber summary involves the calculation of 5 summary statistical quantities: namely:
 Median: The middle value in the sample, also called the 50th percentile or the 2nd quartile.
 1st Quartile: The 25th percentile.
 3rd Quartile: The 75th percentile.
 Minimum: The smallest observation in the sample.
 Maximum: The largest observation in the sample.
A quartile is an observed value at a point that aids in splitting the ordered data sample into four equally sized parts. The median, or 2nd Quartile, splits the ordered data sample into two parts, and the 1st and 3rd quartiles split each of those halves into quarters.
A percentile is an observed value at a point that aids in splitting the ordered data sample into 100 equally sized portions. Quartiles are often also expressed as percentiles.
Both the quartile and percentile values are examples of rank statistics that can be calculated on a data sample with any distribution. They are used to quickly summarize how much of the data in the distribution is behind or in front of a given observed value. For example, half of the observations are behind and in front of the median of a distribution.
Note that quartiles are also calculated in the box and whisker plot, a nonparametric method to graphically summarize the distribution of a data sample.
How to Calculate the FiveNumber Summary
Calculating the fivenumber summary involves finding the observations for each quartile as well as the minimum and maximum observed values from the data sample.
If there is no specific value in the ordered data sample for the quartile, such as if there are an even number of observations and we are trying to find the median, then we can calculate the mean of the two closest values, such as the two middle values.
We can calculate arbitrary percentile values in Python using the percentile() NumPy function. We can use this function to calculate the 1st, 2nd (median), and 3rd quartile values. The function takes both an array of observations and a floating point value to specify the percentile to calculate in the range of 0 to 100. It can also takes a list of percentile values to calculate multiple percentiles; for example:

quartiles = percentile(data, [25, 50, 75]) 
By default, the function will calculate a linear interpolation (average) between observations if needed, such as in the case of calculating the median on a sample with an even number of values.
The NumPy functions min() and max() can be used to return the smallest and largest values in the data sample; for example:

data_min, data_max = data.min(), data.max() 
We can put all of this together.
The example below generates a data sample drawn from a uniform distribution between 0 and 1 and summarizes it using the fivenumber summary.

# calculate a 5number summary from numpy import percentile from numpy.random import rand # generate data sample data = rand(1000) # calculate quartiles quartiles = percentile(data, [25, 50, 75]) # calculate min/max data_min, data_max = data.min(), data.max() # print 5number summary print(‘Min: %.3f’ % data_min) print(‘Q1: %.3f’ % quartiles[0]) print(‘Median: %.3f’ % quartiles[1]) print(‘Q3: %.3f’ % quartiles[2]) print(‘Max: %.3f’ % data_max) 
Running the example generates the data sample and calculates the fivenumber summary to describe the sample distribution.
We can see that the spread of observations is close to our expectations showing 0.27 for the 25th percentile 0.53 for the 50th percentile, and 0.76 for the 75th percentile, close to the idealized values of 0.25, 0.50, and 0.75 respectively.

Min: 0.000 Q1: 0.277 Median: 0.532 Q3: 0.766 Max: 1.000 
Use of the FiveNumber Summary
The fivenumber summary can be calculated for a data sample with any distribution.
This includes data that has a known distribution, such as a Gaussian or Gaussianlike distribution.
I would recommend always calculating the fivenumber summary, and only moving on to distribution specific summaries, such as mean and standard deviation for the Gaussian, in the case that you can identify the distribution to which the data belongs.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
 Describe three examples in a machine learning project where a fivenumber summary could be calculated.
 Generate a data sample with a Gaussian distribution and calculate the fivenumber summary.
 Write a function to calculate a 5number summary for any data sample.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
API
Articles
Summary
In this tutorial, you discovered the fivenumber summary for describing the distribution of a data sample without assuming a specific data distribution.
Specifically, you learned:
 Data summarization, such as calculating the mean and standard deviation, are only meaningful for the Gaussian distribution.
 The fivenumber summary can be used to describe a data sample with any distribution.
 How to calculate the fivenumber summary in Python.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
more link ADS
Smart Retail, Smart Agriculture, Smart supply Chain, Smart Health, Smart energy, Smart City
Blockchain, bitcoin, ethereum, blockchain technology, cryptocurrencies
Information Security, latest Hacking News, Cyber Security, Network Sec
Information Security, latest Hacking News, Cyber Security, Network Security
Blog! Development Software and Application Mobile
Development apps, Android, Ios anh Tranning IT, data center, hacking
Car News, Reviews, Pricing for New & Used Cars, car reviews and news, concept cars
Travel Blog is a unique free online travel diary for travellers across the world.