Exploratory Data Analysis with Python Code Examples

Exploratory data analysis (EDA) is a crucial step in the data science workflow that helps to understand the structure and content of the data. In this article, we’ll provide an introduction to EDA in Python, along with some code examples using the Pandas and Matplotlib libraries.

Photo by Stephen Dawson on Unsplash

Importing Data

The first step in EDA is to import the data into Python. We’ll be using a dataset from the UCI Machine Learning Repository that contains information about wine quality. The dataset can be downloaded from here.

We can use the Pandas library to import the data from a CSV file:

import pandas as pd 
data = pd.read_csv('winequality-red.csv', sep=';')

This code reads the data from the CSV file and stores it in a Pandas DataFrame object.Understanding the Data

Once we have imported the data, the next step is to understand its structure and content. We can start by looking at the first few rows of the data using the head() method:

print(data.head())

This code prints the first five rows of the data to the console. We can also use the info() method to get more information about the data, such as the data types and the number of non-null values for each column:pythonCopy code

print(data.info())

This code prints a summary of the data to the console. We can see that the dataset contains 1599 rows and 12 columns, and that some of the columns have missing values.

Visualizing the Data

EDA also involves visualizing the data to identify patterns and relationships. We can use the Matplotlib library to create a histogram of the wine quality ratings:pythonCopy cod

import matplotlib.pyplot as plt 
plt.hist(data['quality'])

plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

This code creates a histogram of the wine quality ratings and displays it on the screen. We can see that the ratings are mostly between 5 and 7, with very few wines receiving a rating of 3 or 8.

We can also create a scatter plot of two variables to visualize their relationship. For example, we can create a scatter plot of alcohol content versus quality:pythonCopy code

plt.scatter(data['alcohol'], data['quality']) 

plt.xlabel('Alcohol')
plt.ylabel('Quality')
plt.show()

This code creates a scatter plot of alcohol content versus quality and displays it on the screen. We can see that there is a positive relationship between alcohol content and quality.

Cleaning the Data

EDA also involves cleaning the data to remove any missing or erroneous values. We can use the Pandas library to identify and remove any rows with missing values:pythonCopy code

data.dropna(inplace=True)

This code removes any rows with missing values from the DataFrame.

Here are a few more examples of EDA in Python:

Example 1: Titanic Dataset

The Titanic dataset is a classic example used for teaching machine learning. The dataset contains information about passengers on the Titanic, including their age, gender, ticket class, and whether or not they survived the disaster.

Here’s how we can perform EDA on the Titanic dataset:pythonCopy code

import pandas as pd
import seaborn as sns

titanic = pd.read_csv('titanic.csv')

# Display summary statistics of the numerical columns
print(titanic.describe())

# Visualize the distribution of passenger ages
sns.histplot(data=titanic, x="Age", hue="Survived", bins=20)

# Visualize the relationship between fare and survival rate
sns.boxplot(data=titanic, x="Survived", y="Fare")

This code imports the Titanic dataset using Pandas, and then uses the Seaborn library to visualize the distribution of passenger ages and the relationship between fare and survival rate.

Example 2: Iris Dataset

The Iris dataset is another classic example used for teaching machine learning. The dataset contains information about different species of Iris flowers, including the length and width of their petals and sepals.

Here’s how we can perform EDA on the Iris dataset:

import pandas as pd
import seaborn as sns

iris = sns.load_dataset('iris')

# Display summary statistics of the numerical columns
print(iris.describe())

# Visualize the relationship between petal length and width
sns.scatterplot(data=iris, x="petal_length", y="petal_width", hue="species")

# Visualize the distribution of sepal length for each species
sns.kdeplot(data=iris, x="sepal_length", hue="species", fill=True)

This code uses Seaborn to visualize the relationship between petal length and width, and the distribution of sepal length for each species of Iris flower.

Example 3: Ames Housing Dataset

The Ames Housing dataset is a real-world example used for predicting housing prices. The dataset contains information about various attributes of houses in Ames, Iowa, including their size, location, and condition.

Here’s how we can perform EDA on the Ames Housing dataset:

import pandas as pd
import seaborn as sns

housing = pd.read_csv('housing.csv')

# Display summary statistics of the numerical columns
print(housing.describe())

# Visualize the distribution of sale prices
sns.histplot(data=housing, x="SalePrice")

# Visualize the relationship between living area and sale price
sns.scatterplot(data=housing, x="GrLivArea", y="SalePrice", hue="OverallQual")

This code imports the Ames Housing dataset using Pandas, and then uses Seaborn to visualize the distribution of sale prices and the relationship between living area and sale price for different levels of overall quality.

By performing EDA on these datasets, data scientists can gain valuable insights into the data, identify potential issues, and make informed decisions about which machine learning models to use.

Conclusion

EDA is a critical step in the data science workflow that helps to understand the structure and content of the data. In this article, we provided an introduction to EDA in Python, along with some code examples using the Pandas and Matplotlib libraries. We covered how to import and understand the data, how to visualize the data, and how to clean the data.

By performing EDA, data scientists can identify potential issues with the data that could impact the accuracy of the final results.

--

--

Sercan Gul | Data Scientist | DataScientistTX

Senior Data Scientist @ Pioneer | Ph.D Engineering & MS Statistics | UT Austin