A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.
Samples of data where we already know or can easily identify the distribution of are called parametric data. Often, parametric is used to refer to data that was drawn from a Gaussian distribution in common usage. Data in which the distribution is unknown or cannot be easily identified is called nonparametric.
In the case where you are working with nonparametric data, specialized nonparametric statistical methods can be used that discard all information about the distribution. As such, these methods are often referred to as distributionfree methods.
In this tutorial, you will discover nonparametric statistics and their role in applied machine learning.
After completing this tutorial, you will know:
 The difference between parametric and nonparametric data.
 How to rank data in order to discard all information about the data’s distribution.
 Example of statistical methods that can be used for ranked data.
Let’s get started.
Tutorial Overview
This tutorial is divided into 4 parts; they are:
 Parametric Data
 Nonparametric Data
 Ranking Data
 Working with Raked Data
Parametric Data
Parametric data is a sample of data drawn from a known data distribution.
This means that we already know the distribution or we have identified the distribution, and that we know the parameters of the distribution. Often, parametric is shorthand for realvalued data drawn from a Gaussian distribution. This is a useful shorthand, but strictly this is not entirely accurate.
If we have parametric data, we can use parametric methods. Continuing with the shorthand of parametric meaning Gaussian. If we have parametric data, we can harness the entire suite of statistical methods developed for data assuming a Gaussian distribution, such as:
 Summary statistics.
 Correlation between variables.
 Significance tests for comparing means.
In general, we prefer to work with parametric data, and even go so far as to use data preparation methods that make data parametric, such as data transforms, so that we can harness these wellunderstood statistical methods.
Nonparametric Data
Data that does not fit a known or wellunderstood distribution is referred to as nonparametric data.
Data could be nonparametric for many reasons, such as:
 Data is not realvalued, but instead is ordinal, intervals, or some other form.
 Data is realvalued but does not fit a well understood shape.
 Data is almost parametric but contains outliers, multiple peaks, a shift, or some other feature.
There are a suite of methods that we can use for nonparametric data called nonparametric statistical methods. In fact, most parametric methods have an equivalent nonparametric version.
In general, the findings from nonparametric methods are less powerful than their parametric counterparts, namely because they must be generalized to work for all types of data. We can still use them for inference and make claims about findings and results, but they will not hold the same weight as similar claims with parametric methods. Information about the distribution is discarded.
In the case of ordinal or interval data, nonparametric statistics are the only type of statistics that can be used. For realvalued data, nonparametric statistical methods are required in applied machine learning when you are trying to make claims on data that does not fit the familiar Gaussian distribution.
Ranking Data
Before a nonparametric statistical method can be applied, the data must be converted into a rank format.
As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests.
Ranking data is exactly as its name suggests. The procedure is as follows:
 Sort all data in the sample in ascending order.
 Assign an integer rank from 1 to N for each unique value in the data sample.
For example, imagine we have the following data sample, presented as a column:

0.020 0.184 0.431 0.550 0.620 
We can sort it as follows:

0.020 0.184 0.431 0.550 0.620 
Then assign a rank to each value, starting at 1:

1 = 0.021055 2 = 0.404622 3 = 0.488733 4 = 0.618510 5 = 0.832803 
We can then apply this procedure to another data sample and start using nonparametric statistical methods.
There are variations on this procedure for special circumstances such as handling ties, using a reverse ranking, and using a fractional rank score, but the general properties hold.
The SciPy library provides the rankdata() function to rank numerical data, which supports a number of variations on ranking.
The example below demonstrates how to rank a numerical dataset.

from numpy.random import rand from numpy.random import seed from scipy.stats import rankdata # seed random number generator seed(1) # generate dataset data = rand(1000) # review first 10 samples print(data[:10]) # rank data ranked = rankdata(data) # review first 10 ranked samples print(ranked[:10]) 
Running the example first generates a sample of 1,000 random numbers from a uniform distribution, then ranks the data sample and prints the result.

[4.17022005e01 7.20324493e01 1.14374817e04 3.02332573e01 1.46755891e01 9.23385948e02 1.86260211e01 3.45560727e01 3.96767474e01 5.38816734e01] [408. 721. 1. 300. 151. 93. 186. 342. 385. 535.] 
Working with Raked Data
There are statistical tools that you can use to check if your sample data fits a given distribution.
For example, if we take nonparametric data as data that does not look Gaussian, then you can use statistical methods that quantify how Gaussian a sample of data is and use nonparametric methods if the data fails those tests.
Three examples of statistical methods for normality testing, as it is called, are:
 ShapiroWilk test.
 KolmogorovSmirnov test.
 AndersonDarling test
Once you have decided to use nonparametric statistics, you must then rank your data.
In fact, most of the tools that you use for inference will perform the ranking of the sample data automatically. Nevertheless, it is important to understand how your sample data is being transformed prior to performing the tests.
In applied machine learning, there are two main types of questions that you may have about your data that you can address with nonparametric statistical methods.
Relationship Between Variables
Methods for quantifying the dependency between variables are called correlation methods.
Two nonparametric statistical correlation methods that you can use are:
 Spearman’s rank correlation coefficient.
 Kendall rank correlation coefficient.
Compare Sample Means
Methods for quantifying whether the mean between two populations is significantly different are called statistical significance tests.
Three nonparametric statistical significance tests that you can use are:
 Friedman test.
 MannWhitney U test.
 Wilcoxon signedrank test.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
 List three examples of when you think you might need to use nonparametric statistical methods in an applied machine learning project.
 Develop your own example to demonstrate the capabilities of the rankdata() function.
 Write your own function to rank a provided univariate dataset.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
API
Articles
Summary
In this tutorial, you discovered nonparametric statistics and their role in applied machine learning.
Specifically, you learned:
 The difference between parametric and nonparametric data.
 How to rank data in order to discard all information about the data’s distribution.
 Example of statistical methods that can be used for ranked data.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
more link ADS
Smart Retail, Smart Agriculture, Smart supply Chain, Smart Health, Smart energy, Smart City
Blockchain, bitcoin, ethereum, blockchain technology, cryptocurrencies
Information Security, latest Hacking News, Cyber Security, Network Sec
Information Security, latest Hacking News, Cyber Security, Network Security
Blog! Development Software and Application Mobile
Development apps, Android, Ios anh Tranning IT, data center, hacking
Car News, Reviews, Pricing for New & Used Cars, car reviews and news, concept cars
Travel Blog is a unique free online travel diary for travellers across the world.