Mahalanobis Distance in Python [Easy Implementation]

Filed Under: Python
Mahalanobis Distance In Python

Mahalanobis distance is an effective multivariate distance metric that helps to measure the distance between a data point and a data distribution.

It is an extremely useful metric in multivariate anomaly detection and also classification on highly imbalanced datasets.

This tutorial explains what exactly Mahalanobis distance is and how to compute the same in the Python programming language.

Also read: Jaccard Similarity and Distance in Python

Formula for Mahalanobis Distance

The formula to compute Mahalanobis distance is as follows:

Mahalanobis Distance Formula
Mahalanobis Distance Formula

where,

  • D^2 is the square of the Mahalanobis distance.
  • x is the vector of the observation (row in a dataset),
  • m is the vector of mean values of independent variables (mean of each column),
  • C^(-1) is the inverse covariance matrix of independent variables.

Code Implementation for Mahalanobis Distance in Python

We need to install and import the following libraries for computing the Distance in Python: NumPy, pandas, and scipy.

import numpy as np
import pandas as pd 
import scipy as stats

We will be considering a dataset of 10 food items and the dataset will contain the following five sections: 

  1. Price of the Food Item
  2. Amount Protein in Food
  3. Quantity Fat in the Food
  4. Amount of Carbohydrate the food has
data = { 'Price': [100000, 800000, 650000, 700000,
                   860000, 730000, 400000, 870000,
                   780000, 400000],
         'Protein': [16000, 60000, 300000, 10000,
                      252000, 350000, 260000, 510000,
                      2000, 5000],
         'Fat': [300, 400, 1230, 300, 400, 104,
                      632, 221, 142, 267],
         'Carbohydrate': [60, 88, 90, 87, 83, 81, 72, 
                         91, 90, 93],
           }
food_data = pd.DataFrame(data,columns=['Price', 'Protein',
                                'Fat','Carbohydrate',])
food_data.head()
Food Items Dataset Mahalanobis Distance
Food Items Dataset Mahalanobis Distance

Next, we will write a short function to calculate the distance and add the same as a column in the original dataframe.

def mahalanobis(x=None, data=None, cov=None):
    x_mu = x - np.mean(data)
    if not cov:
        cov = np.cov(data.values.T)
    inv_covmat = np.linalg.inv(cov)
    left = np.dot(x_mu, inv_covmat)
    mahal = np.dot(left, x_mu.T)
    return mahal.diagonal()

food_data['Mahalanobis_Dis'] = mahalanobis(x=food_data, 
                                data=food_data[['Price', 'Protein',
                                                'Fat','Carbohydrate',]])
food_data.head()
Food Items Dataset Mahalanobis Distance Computed
Food Items Dataset Mahalanobis Distance Computed

We can see that some of the distances are much larger than others. To determine if any of the distances are statistically significant, we need to calculate their p-values.

The p-value for each distance is calculated as the p-value that corresponds to the Chi-Square statistic of the distance with k-1 degrees of freedom, where k = number of variables. 

from scipy.stats import chi2
food_data['p_value'] = 1 - chi2.cdf(food_data['Mahalanobis_Dis'], 3)
food_data.head()
Food Items Dataset P Value Computed
Food Items Dataset P-Value Computed

You will also like to know that if a p-value is less than .001, the data point is an outlier. Depending on your problem, you may decide to remove an observation from the dataset if it is an outlier and end up affecting the results of your analysis.


Conclusion

Congratulations! In this tutorial, we covered Mahalanobis distance: the formula and its actual calculation in Python. Along with this, you also know how to determine the outliers in a dataset and as a result make your analysis even more accurate!

Thank you for reading the tutorial!

close
Generic selectors
Exact matches only
Search in title
Search in content