Demystifying Dimensions: A Guide to Principal Component Analysis (PCA) in Python

Data analysis in the real world often involves datasets with multiple features. While this abundance of information can be valuable, it also presents challenges. High-dimensional data can be difficult to visualize, interpret, and analyze. Enter Principal Component Analysis (PCA), a powerful technique for dimensionality reduction that helps us navigate this complexity.

What is PCA?

In simple terms, PCA helps us identify the most important features in a dataset by extracting new features, called principal components, that capture the majority of the variance in the original data. These principal components are essentially new axes that explain the greatest amount of variation in the data. By focusing on these key components, we can achieve several benefits:

  • Dimensionality reduction: Reduce the number of features without losing significant information, making data processing and visualization easier.
  • Improved performance: Reduce the computational cost of machine learning algorithms.
  • Reduced noise: Eliminate irrelevant and redundant information, leading to more accurate models.

Implementing PCA in Python

Python’s scikit-learn library provides a powerful set of tools for implementing PCA. Here’s a basic example (click here to visit my Github repo for the source code and dataset)

#Load data
import pandas as pd
df = pd.read_csv('wine.data.csv')
#Standard scaling 
#PCA requires scaling/normalization of the data to work properly
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

#Define X and y from the dataset
X = df.drop('Class',axis=1)
y = df['Class']

#Transform X using the scaler
X = scaler.fit_transform(X)
dfx = pd.DataFrame(data=X,columns=df.columns[1:])
#PCA class import and analysis
from sklearn.decomposition import PCA

#n_components (int, float or ‘mle’, default=None)
#Number of components to keep. if n_components is not set all components are kept:

pca = PCA(n_components=None)
dfx_pca = pca.fit(dfx)
#Now, let's visualize the PCA results

plt.figure(figsize=(10,6))
plt.scatter(x=[i+1 for i in range(len(dfx_pca.explained_variance_ratio_))],
y=dfx_pca.explained_variance_ratio_,
s=200, alpha=0.75,c='orange',edgecolor='k')
plt.grid(True)
plt.title("Explained variance ratio of the \nfitted principal component vector\n",fontsize=25)
plt.xlabel("Principal components",fontsize=15)
plt.xticks([i+1 for i in range(len(dfx_pca.explained_variance_ratio_))],fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel("Explained variance ratio",fontsize=15)
plt.show()
#If we use 8 of the PCA instead of the initial 13 features, we can achieve a total explained variance of 92%.

Result = we managed to reduce our dimensionality from 13 to 8 while keeping 92% of the variance in the dataset, almost a 50% reduction!

Beyond the Basics:

  • Visualization: We can visualize the principal components using scatter plots or heatmaps to understand the relationships between features and identify potential clusters in the data.
  • Choosing the number of components: We can use the explained variance ratio to determine the optimal number of principal components to retain, balancing dimensionality reduction with information loss.
  • Applications: PCA has numerous applications in various fields, including image and signal processing, machine learning, and anomaly detection.

Further Exploration:

This article provides a basic introduction to PCA in Python. To delve deeper, explore the following resources:

By mastering PCA, you unlock a powerful tool for tackling high-dimensional data and extracting meaningful insights from your analyses. The journey continues!

--

--

Sercan Gul | Data Scientist | DataScientistTX

Senior Data Scientist @ Pioneer | Ph.D Engineering & MS Statistics | UT Austin