Motivation
This project deals with many real-world challenges faced by e-commerce websites that includes predicting customer lifetime value using RFM score and k-means clustering, customer segmentation to find out best valued customers. Also, predicting review score that customers will give to their order experience depending on their location, order cost and other factors. I have also done a detailed analysis of how geolocation can affect user’s experience and their purchase and much more.
Dataset
Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. Also included is a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.
This dataset have nine tables which are connected by common attributes.
Project Overview
The project is divided into two parts:
- Analysis and Visualizations- Comprehensive anaysis, understanding metrics and graph plotting
- Predictions- Customer lifetime value prediction, predicting customer satisfaction and segmentation
Part 1: Analysis and Visualizations
Few instances of the analysis performed are as follows:
- There is an upward trend in orders purchased between January 2017 and July 2018.
- Unfortunately, customers who live in the north and northeast of Brazil have to bear with higher freight costs and has to wait longer to receive their purchase.
- The consequence of late delivery is that the customers in north side are likely to give lower reviews.
- Most of the revenue is coming from the Southeast and South regions of Brazil. Also, large cities and capitals, where population is bigger, have larger participation on revenue.
- Monthly retention rate shows whether customers continue to buy products from the website or not. There is an upward linear trend here but with few major exceptions.
- Monday blues is showing effect on purchases. Customers seem to most likely buy on Monday then any other day of the week.
- Afternoon boredom is making customers buy more items then any other time of the day.
- Company seems to be doing well in terms of delivery time since most of the orders are getting delivered way before they are expected.
Part 2: Predictions
2.1 Predicting Customer Satisfaction: what score will a customer give for the order
The model helps to find a way to estimate that i.e. based on data about the product and order what will be the customer review score.
The main hypothesis is that the product and how the order was fulfilled might influence the customer review score. Keeping in mind that each feature created is a new hypothesis to test.
Designing an Experiment:
To answer the question data is collected from each order.From placing the order up to the delivery phase. With that, the model implemented estimates what will be the score given by the customer at the review phase.
Purchase –> Transport –> Delivery –> Review
[ Extract Features ] [ Make Prediction ]
2.1.1 Cleaning and Feature Engineering
The data was a mix of categorical, numerical as well as null values in 9 columns. In order to guarantee that the same transformation is applied to new/unseen data, I created custom transformers using scikit-learn BaseEstimator. Also, seven new features were engineered for better results: Working Days Estimated Delivery Time, Actual Delivery Time, Delivery Time, Is Late, Average Product Value, Total Order Value, Order Freight Ratio and Purchase Day of Week.
2.1.2 Model Building
- I created two separate pipelines for categorical and numerical data.
- In categorical pipeline, to deal with missing values I used imputation of most frequent values and then applied one hot encoding.
- In numerical pipeline, I used median imputation for null values and then standard scaling.
- Since the data was highly imbalanced, I used Stratified Shuffle Split that preserves the percentage of samples for each class.
- Among k-Neighbors, Decision Tree, Random Forest, AdaBoost and Gradient BoostingClassifier, Random Forest classifier gave the best accuracy of 62%. I am planning to do the hyperaprameter tuning in future to improve the score.
2.2 Customer Segmentation
All customers using the website are not equally important, they have different needs and their own different profile. Our actions should adapt depending on that. There are different segmentations depending on what we are trying to achieve. To increase retention rate, we can do a segmentation based on churn probability and take actions. I am using RFM here.
RFM stands for Recency - Frequency - Monetary Value.
There are three segments of customers in this case:
- Low Value: Customers who are less active than others, not very frequent buyer/visitor and generates very low - zero - maybe negative revenue.
- Mid Value: In the middle of everything. Often using the platform (but not as much as High Values), fairly frequent and generates moderate revenue.
- High Value: The group we don’t want to lose. High Revenue, Frequency and low Inactivity.
I calculated the Recency, Frequency and Monetary Value (called Revenue from now on) and apply unsupervised machine learning to identify different groups (clusters) for each.
2.2.1 Recency - To calculate recency, I used most recent purchase date of each customer and see how many days they are inactive for. After having no. of inactive days for each customer applied K-means clustering to assign customers a recency score.
The plot shows how is the distribution of recency across the customers with customer id on x-axis and recency on y-axis.
2.2.2 Frequency - To create frequency clusters, used total number of orders for each customer.
The x-axis on the plot shows number of times a customer bought a product and y-axis is frequeny. The plot is right skewed with most of the customers buying products less then 5 times or more specifically just one time.
2.2.3 Revenue -
The plot has number of products on x-axis and revenue in Brazilian Reais (BRL) on y-axis. This plot is also right skewed with most of the products having very high monetary value.
2.2.4 Overall Score -
The scoring above clearly shows us that customers with score 8 is our best customers whereas 0 is the worst.
To keep things simple the scores are renamed:
0 to 2: Low Value
3 to 4: Mid Value
5+: High Value
You can see how the segments are clearly differentiated from each other in terms of RFM.
We can start taking actions with this segmentation. The main strategies are quite clear:
High Value: Improve Retention
Mid Value: Improve Retention + Increase Frequency
Low Value: Increase Frequency
2.3 Customer Lifetime Value Prediction
We invest in customers (acquisition costs, offline ads, promotions, discounts & etc.) to generate revenue and be profitable. Naturally, these actions make some customers super valuable in terms of lifetime value but there are always some customers who pull down the profitability. We need to identify these behavior patterns, segment customers and act accordingly.
To calculate Lifetime Value first we need to select a time window. It can be anything like 3, 6, 12, 24 months. By the equation below, we can have Lifetime Value for each customer in that specific time window:
Lifetime Value: Total Gross Revenue - Total Cost
This equation now gives us the historical lifetime value. If we see some customers having very high negative lifetime value historically, it could be too late to take an action. At this point, we need to predict the future with machine learning. We are going to build a simple machine learning model that predicts our customers lifetime value.
RFM scores for each customer ID are used as feature set. I took 3 months of data, calculate RFM and use it for predicting next 6 months. There is no cost specified in the dataset that’s why Revenue becomes our LTV directly. After RFM scoring, the feature set looks like this-
Positive correlation is quite visible here. High RFM score means high LTV.
I performed feature engineering, converted categorical columns to numerical columns, checked the correlation of features against our label, LTV clusters and split feature set and label (LTV) as X and y.
Then used XGBoost to do the classification. Since there are3 groups, it is a multi classification model.
I am getting 99% accuracy score on both train and test set which is odd. I need to investigate this. The future course of action is:
- Adding more features and improve feature engineering
- Try different models other than XGBoost
- Apply hyper parameter tuning to current model
- Add more data to the model if possible