analyze correlations — DataPrep 0.4.0 documentation

`plot_correlation()`: analyze correlations¶

Overview¶

The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of plot_correlation() for a given dataframe df.

plot_correlation(df): plots correlation matrices (correlations between all pairs of columns)
plot_correlation(df, col1): plots the most correlated columns to column col1
plot_correlation(df, col1, col2): plots the joint distribution of column col1 and column col2 and computes a regression line

The following table summarizes the output plots for different settings of col1 and col2.

`col1`	`col2`	Output
None	None	n*n correlation matrix, computed with Person, Spearman, and KendallTau correlation coefficients
Numerical	None	n*1 correlation matrix, computed with Pearson, Spearman, and KendallTau correlation coefficients
Categorical	None	TODO
Numerical	Numerical	scatter plot with a regression line
Numerical	Categorical	TODO
Categorical	Numerical	TODO
Categorical	Categorical	TODO

Next, we demonstrate the functionality of plot_correlation().

Load the dataset¶

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known wine quality dataset into a Pandas dataframe.

from dataprep.datasets import load_dataset
df = load_dataset("wine-quality-red")

Get an overview of the correlations with `plot_correlation(df)`¶

We start by calling plot_correlation(df) to compute the statistics and correlation matrices using Pearson, Spearman, and KendallTau correlation coefficients. For the Stats tab, we list four statistics for these three correlation coefficients respectively. Other three tabs are the lower triangular matrices. In each matrix, a cell represents the correlation value between two columns. There is an “insight” tab (!) in the upper right-hand corner of each matrix, which shows some insight information. The following shows an example:

from dataprep.eda import plot_correlation
plot_correlation(df)

DataPrep.EDA Report

Stats Pearson Spearman KendallTau

	Pearson	Spearman	KendallTau
Highest Positive Correlation	0.672	0.79	0.607
Highest Negative Correlation	-0.683	-0.707	-0.528
Lowest Correlation	0.002	0.001	0.0
Mean Correlation	0.019	0.028	0.021

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (fixed_acidity, citric_acid)
Most negative correlated: (fixed_acidity, pH)
Least correlated: (volatile_acidity, residual_sugar)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (free_sulfur_d...ide, total_sulfur_...ide)
Most negative correlated: (fixed_acidity, pH)
Least correlated: (total_sulfur_...ide, sulphates)

'height': 400

Height of the plot

'width': 400

Width of the plot

Most positive correlated: (free_sulfur_d...ide, total_sulfur_...ide)
Most negative correlated: (fixed_acidity, pH)
Least correlated: (total_sulfur_...ide, sulphates)

Find the columns that are most correlated to column `col1` with `plot_correlation(df, col1)`¶

After computing the correlation matrices, we can discover how other columns correlate to a specific column x using plot_correlation(df, x). This function computes the correlation between column x and all other columns (using Pearson, Spearman, and KendallTau correlation coefficients), and sorts them in decreasing order. This enables easy determination of the columns that are most positively and negatively correlated with column x. The following shows an example:

plot_correlation(df, "alcohol")

DataPrep.EDA Report

Pearson Spearman KendallTau

'height': 400

Height of the plot

'width': 400

Width of the plot

'height': 400

Height of the plot

'width': 400

Width of the plot

'height': 400

Height of the plot

'width': 400

Width of the plot

Explore the correlation between two columns with `plot_correlation(df, col1, col2)`¶

Furthermore, plot_correlation(df, col1, col2) provides detailed analysis of the correlation between two columns col1 and col2. It plots the joint distribution of the columns col1 and col2 as a scatter plot, as well as a regression line. The following shows an example:

plot_correlation(df, "alcohol", "pH")

DataPrep.EDA Report

Scatter Plot & Regression Line

'scatter.sample_size': 1000

Number of points to randomly sample per partition

'height': 400

Height of the plot

'width': 400

Width of the plot

plot_correlation(): analyze correlations¶

Overview¶

Load the dataset¶

Get an overview of the correlations with plot_correlation(df)¶

Find the columns that are most correlated to column col1 with plot_correlation(df, col1)¶

Explore the correlation between two columns with plot_correlation(df, col1, col2)¶

`plot_correlation()`: analyze correlations¶

Get an overview of the correlations with `plot_correlation(df)`¶

Find the columns that are most correlated to column `col1` with `plot_correlation(df, col1)`¶

Explore the correlation between two columns with `plot_correlation(df, col1, col2)`¶