In this project, I performed a complete data analysis and machine learning pipeline using real e-commerce transaction data. The goal was to understand customer behavior and predict high-value customers. The dataset consists of 392,692 purchases from customers across multiple countries.
Handled missing values, removed negative quantities (returns), eliminated duplicates
UK leads sales, followed by Netherlands, Ireland, and Germany
Created customer-level features: Frequency, Monetary Value, Total Products
First two components explain 100% variance (72.9% + 27.1%)
The goal was to build a predictive model to identify high-value customers—those who generate significant revenue and show strong engagement. This helps businesses focus retention efforts and optimize marketing strategies.
The dataset contains 392,692 purchase records. I transformed this into a customer-level dataset with key features:
# PCA for dimensionality reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"PC1 variance: {pca.explained_variance_ratio_[0]:.1%}")
print(f"PC2 variance: {pca.explained_variance_ratio_[1]:.1%}")
# Output: PC1 variance: 72.9%
# Output: PC2 variance: 27.1%
United Kingdom dominates sales, followed by Netherlands, Ireland, and Germany
"Paper Craft Little Birdie" and "Medium Ceramic Top Storage Jar" are bestsellers
Random Forest achieved the best performance with ~90% accuracy in predicting high-value customers, demonstrating strong pattern recognition in customer behavior.
This project demonstrates a complete data science pipeline—from data cleaning and exploratory analysis to predictive modeling. The insights gained can help e-commerce companies better understand their customers and improve marketing strategies.