She did WHAT?

So what happens when your world falls apart — and you were a party to it? When your identity is stolen and your computer hacked because you trusted the wrong person and fell victim to an online scam…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

RFM Customer Segmentation using Python

Segmentation of customers in online retail databases using Python, including RFM analysis and clustering.

Give me a clap if you find my article is useful.👏👏👏

In this article, I am going to write about how to carry out customer segmentation and RFM analysis on online retail data using python.

This data set contains all of the transactions recorded for an online retailer based and registered in the UK between 2009–12–01 and 2011–12–09. The retailer specializes in all-occasion gift items. Most of the retailer’s customers are wholesalers.

1. Reading data and preprocessing
2. Create Recency Frequency Monetary (RFM) table
3. Model — Clustering with K-means algorithm
4. Interpret the result

We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:

RFM is a basic customer segmentation algorithm based on their purchasing behavior. The behavior is identified by using only three customer data points:

The RFM Analysis will help the businesses to segment their customer base into different homogenous groups so that they can engage with each group with different targeted marketing strategies. Sometime RMF is also used to identify the High-Value Customers (HVCs).

Before we get into the process, I will give you a brief on what kind of steps we will get.

Right now, the dataset consists of recency, frequency, and monetary value column. But we cannot use the dataset yet because we have to preprocess the data more.

We have to make sure that the data meet these assumptions:

Because of that, we have to manage the skewness of the variables. Here are the visualizations of each variable.

As we can see from above, we have to transform the data, so it has a more symmetrical form. There are some methods that we can use to manage the skewness:

Based on that calculation, we will utilize variables that use box-cox transformations. Except for the MonetaryValue variable because the variable includes negative values. To handle this variable, we can use cubic root transformation to the data.

Each variable don’t have the same mean and variance. We have to normalize it. To normalize, we can use StandardScaler object from scikit-learn library to do it.

Finally, we can do clustering using that data.

To make segmentation from the data, we can use the K-Means algorithm to do this.

K-Means algorithm is an unsupervised learning algorithm that uses the geometrical principle to determine which cluster belongs to the data. By determine each centroid, we calculate the distance to each centroid. Each data belongs to a centroid if it has the smallest distance from the other. It repeats until the next total of the distance doesn’t have significant changes than before.

To make our clustering reach its maximum performance, we have to determine which hyperparameter fits to the data. To determine which hyperparameter is the best for our model and data, we can use the elbow method to decide.

The x-axis is the value of the k, and the y-axis is the SSE value of the data. We will take the best parameter by looking at where the k-value will have a linear trend on the next consecutive k. From the above plot, the k-value of 4 is the best hyperparameter for our model because the next k-value tend to have a linear trend.

From the above table, we can compare the distribution of mean values of recency, frequency, and monetary metrics across 4 cluster data. It seems that we get a more detailed distribution of our customer base using k=4.

Another commonly used method to compare the cluster segments is Snakeplots. They are commonly used in marketing research to understand customer perceptions.

Besides that, we can analyze the segments using snake plot. It requires the normalized dataset and also the cluster labels. By using this plot, we can have a good visualization from the data on how the cluster differs from each other.

From the above snake plot, we can see the distribution of recency, frequency, and monetary metric values across the four clusters. The four clusters seem to be separate from each other, which indicates a good heterogeneous mix of clusters.

The scatter plot is the data analysis method we use when we have more than two variables. Remove the outlier from the plot to create a clear visualization result. Those outliers are taken into consideration in the model development. Exclude them only for visualization purposes.

A high frequency is found with customers who have a recent purchase within a month.

Customers who buy frequently spend less money.

In the above plot, the color specifies Cluster. From the above plot, we can see how the customers are spread among Recency, Frequency and Monetary dimension. Customers in Cluster 1 have made recent purchases with a high frequency, but with lower amounts. The reason for this could be that the customer frequently purchase Accessories that are not so expensive.

From the above analysis, we can see that there should be 4 clusters in our data. The Heatmap above get the related importance of attributes among the clusters. Monetary Value is high positively correlated with Cluster 3(with a Person’s correlation coefficient of 18.21)

This step can be done in two ways:

After calculations on the RFM data we can create customer segments that are actionable and easy to understand.

Tree map of the customer segment and score

Based on RFM analysis, there are 8% of loyal customers who tend to spend big amount of money while buying. Also there are groups of customers who are already lost and who are going to be lost in near future.

Here’s a handy chart of all the RFM Segments, and some actionable tips for each which can implement straight away!

Addition of new variables like Tenure: The number of days since the first transaction by each customer. This will tell us how long each customer has been with the system. Conducting deeper segmentation on customers based on their geographical location, and demographic and psychographic factors.

Give me a clap if you find my article is useful.👏👏👏

She did WHAT?

RFM Customer Segmentation using Python

Add a comment

Related posts:

Education. Not Degradation.

Tips to Learn Spanish Language Fast

Could I get a large pizza with marijuana on the side?