Pandas Profiling: Exploratory Data Analysis

--

Image source: https://github.com/pandas-profiling/pandas-profiling

In this article, I would like to share a tool that generates profile reports from a pandas DataFrame in Python with only two lines of code! Yes, you heard it right, most of your exploratory data analytics only with two lines of code in Python.

This tool is called pandas-profiling, created by Simon Brugman (https://github.com/pandas-profiling/pandas-profiling). Even though it was first published open-source in GitHub in 2016, I have recently started hearing more and more about pandas-profiling this year and I guarantee that it saves lots of time generating all the visualizations and exploratory statistics that you need!

Installation

You can install using the pip package manager by running:

pip install pandas-profiling

or using the conda package manager by running:

conda install -c conda-forge pandas-profiling

Getting started

It is advised to use Jupyter Notebook to get the most out of this tool and enjoy the interactive properties of pandas-profiling. Let’s create a simple dataframe, and generate our profiler using 2 lines of code!

As you can see in the picture below, the profiler provides 6 main categories (Overview, Variables, Interactions, Correlations, Missing values, Sample), which are explained below:

Overview: provides details regarding missing and duplicated values in the data frame — an important evaluation for machine learning applications!

Variables: the distribution and further statistics (minimum, maximum, median, common values, extreme values, etc) about each feature.

Interactions: the relationship between two variables in the data frame as scatter plots.

Correlations: Pearson’s, Spearman’s, Kendall’s and Phik correlation coefficients between the features by further explanations of correlation descriptions.

Missing values: details regarding missing data of each feature in the data frame.

Sample: samples from the dataframe which is the same as the dataframe samples obtained by df.head() and df.tail() methods.

As explained, pandas-profiling brings most of the various data exploratory approaches (data statistics, distributions, correlation coefficients, interactions, etc) into one simple and easy-to-use tool.

Pandas-profiling for big data

The only disadvantage of this method is when the datasets are large! Some of the computations in the tool are computationally expensive (such as correlations or duplicate row detection). To save time in the exploratory analysis, the minimal setting can be set as True, which will disable expensive computations using the following code:

Follow me on GitHub: https://github.com/sercangul

Follow me for more information on Python, statistics, and machine learning!

--

--

Sercan Gul | Data Scientist | DataScientistTX
Nerd For Tech

Senior Data Scientist @ Pioneer | Ph.D Engineering & MS Statistics | UT Austin