High-Value Customer Prediction

Overview

In this project, I performed a complete data analysis and machine learning pipeline using real e-commerce transaction data. The goal was to understand customer behavior and predict high-value customers. The dataset consists of 392,692 purchases from customers across multiple countries.

90%

Random Forest Accuracy

392,692

Total Transactions

100%

PCA Variance Explained

4

Top Countries

Data Processing Pipeline

Data Cleaning

Handled missing values, removed negative quantities (returns), eliminated duplicates

Exploratory Analysis

UK leads sales, followed by Netherlands, Ireland, and Germany

Feature Engineering

Created customer-level features: Frequency, Monetary Value, Total Products

PCA Visualization

First two components explain 100% variance (72.9% + 27.1%)

Problem Statement

The goal was to build a predictive model to identify high-value customers—those who generate significant revenue and show strong engagement. This helps businesses focus retention efforts and optimize marketing strategies.

Key Questions:

Which customers are most valuable to the business?
What behavioral patterns distinguish high-value customers?
How can we predict customer value early in their lifecycle?

Data & Methodology

The dataset contains 392,692 purchase records. I transformed this into a customer-level dataset with key features:

Customer Features Engineered:

Frequency: Number of purchases per customer
Total Products: Sum of quantities purchased
Monetary Value: Total amount spent
Country: Geographic distribution

PCA Results:

PC1 explains 72.9% of variance
PC2 explains 27.1% of variance
Total: 100% variance explained in 2 dimensions

PCA Implementation:

# PCA for dimensionality reduction
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"PC1 variance: {pca.explained_variance_ratio_[0]:.1%}")
print(f"PC2 variance: {pca.explained_variance_ratio_[1]:.1%}")
# Output: PC1 variance: 72.9%
# Output: PC2 variance: 27.1%

Key Insights

Geographic Distribution

United Kingdom dominates sales, followed by Netherlands, Ireland, and Germany

Top Products

"Paper Craft Little Birdie" and "Medium Ceramic Top Storage Jar" are bestsellers

Top Predictors of Customer Value:

Purchase Frequency: Number of transactions strongly correlates with value
Average Order Value: Higher spending per transaction
Recency: Time since last purchase indicates engagement

Model Performance

90%

Random Forest

89%

Bagging Classifier

88%

Voting Classifier

Random Forest achieved the best performance with ~90% accuracy in predicting high-value customers, demonstrating strong pattern recognition in customer behavior.

Business Impact:

Target high-value customers with personalized marketing
Improve retention for at-risk valuable customers
Optimize resource allocation for customer acquisition

Conclusion

This project demonstrates a complete data science pipeline—from data cleaning and exploratory analysis to predictive modeling. The insights gained can help e-commerce companies better understand their customers and improve marketing strategies.