A Customer Segmentation Report — DSND Capstone Project
Project Overview
The purpose of this project is to identify possible new customers for a German mail-order company, Bertelsmann Arvato. The identification of these customers is done using a supervised machine learning model, which takes into account the general demographic characteristics of the German population, as well as the results of the segmentation of these individuals through a clustering model.
The data used belongs to Arvato Financial Solution and is used only for the purpose of this project. Different clustering and classification models were analyzed in the development of this project. The solution found is documented in this post and the files used are available in the project repository on Github (https://github.com/danielabrasilva/customer_segmentation_classifier)
The repository has the following files necessary for the execution of the project:
process_data.py
clustering.py
classifier.py
In process_data.py the data is processed and cleaned to be used in the application of the clustering model in clutering.py and then the unsupervised model is used to group the training data of the classification algorithm in classifier.py.
Project Statement
This project is divided into two stages:
I. Customer segmentation: At this stage, we have the azdias.csv dataset for the general population, and also the customers.csv dataset for the company’s customers. Both sets contain columns referring to the demographic characteristics of individuals. The idea here is to bring these sets together and use an unsupervised learning model to segment these individuals into different groups and thereby identify the groups that have the most customer adherence. For this, I tested the K-means, Aglomerative Clustering, DBSCAN and OPTICS algorithms, all from the scikit-learn library. As the execution time of K-means was much shorter than the others, I opted for this solution, because that way it would be easier to find the ideal cluster number.
II. Classification of the population into client or non-client: In this step, we use a data set to train a supervised learning model that classifies whether an individual will be a possible client or not. Before training, the set is treated and segmented into groups according to the clustering model from the previous step. Gradient boosted trees, Linear Support Vector Machine, and Logistic Regression algorithms were tested and the one that showed the best performance according to the chosen metric was Gradient Boosted trees.
Metrics
To evaluate the clustering performance, I used the sum of squares of the distances from each data point in all clusters to their respective centroid, ie Within-Cluster-Sum-of-Squares (WCSS). Another option would be to use the silhouette coefficient, but in addition to the execution time of the function that calculates it being very large, this metric does not work well for clusters with complex shapes (Reference 1).
In the evaluation of the classification algorithms, I used the ROC curve to select the most appropriate solution, since the target variable of the training set is highly unbalanced and therefore the accuracy is not indicated.
Methodology
Data Wrangling
The extraction, transformation and loading of the data — ETL Pipeline — is performed by executing the file process_data.py. This file also contains the “Data Wrangling” step, explained below:
There are four data files associated with this project:
- azdias.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- customers.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- mailout_train.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- mailout_test.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
For the segmentation step, the dataset total_df was created from the union of azdias.csv to customers.csv. This new dataset is treated and cleaned up so that the set has no missing and inconsistent values. Thus, the following strategies were taken:
- Columns and rows with more than 70% of missing values were removed, as well as columns with variability less than 10%.
- The missing values of the columns with less than 70% of the missing values were imputed with the most frequent value of the column and for each column with more than 10% of missing values, a second column was created that indicates the presence of these values for each observation.
def drop_na_rows(df, na_perct=.9):
'''Drop rows with missing values when they represent 90 percent of the rows and return it.
inputs
na_perct(float): the percentage of missing values in a row.
df(pandas DataFrame): the DataFrame
output
new_df(pandas DataFrame): the DataFrame after rows dropped.
'''
thresh = np.ceil(na_perct * df.shape[1])
new_df = df.dropna(axis=0, thresh=thresh)
return new_df
def drop_invariability(df, na_perct=0.9):
'''
Drop rows with more than na_perct pf invariability and return the new dataframe.
'''
cols = df.columns
no_var_cols = []
new_df = df
for col in cols:
aux = df[col].value_counts() / df[col].value_counts().sum()
if np.sum(aux >= na_perct) > 0:
new_df = new_df.drop(labels=col, axis=1)
return new_dfdef impute_values(df, cols_with_missing):''' Return a new dataframe with imputed values in numeric and categotical colums.inputs:df(pandas.DataFrame): The Dataframeoutput:new_df(pandas.DataFrame): imputed dataframe.'''(...)
- The categorical columns with more than 2 values were removed to facilitate the use of the learning model and those that remained were transformed into dummy variables.
Data Preprocessing
Some algorithms like neural networks and SVMs are very sensitive to the scale of the data. Therefore, a common practice is to adjust the characteristics so that the representation of the data is more adequate to these solutions [Reference 1].
Here I used the RobustScaler function of the sci-kit learn, which ensures that the average of the set is 0 and the variance 1, making all features have the same magnitude while ignoring outliers. This pre-processing is done both for the application of clustering and for classification.
Customer Segmentation: Implementation
The clean dataset has 925791 people (rows) x 335 features (columns). It used the K-means clustering algorithm from the scikit-learn library (sklearn.cluster.KMeans) with different numbers of clusters to identify the best result, obtained by the Elbow method, through the analysis of the sum of squared distances of samples to their closest cluster center (Within-Cluster-Sum-of-Squares) for each cluster used. The chosen value was n_cluster = 6.
from sklearn.preprocessing import RobustScaler
from sklearn.cluster import KMeans#Preprocessing
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)
# KMeans Clustering
kmeans = KMeans(n_clusters=6, random_state=42, max_iter=700)
#print("Clustering ... \n")
kmeans.fit(X_scaled)
clusters = kmeans.predict(X_scaled)
X['cluster'] = clusters
To better understand how the concentration of individuals in each group is, I calculated the percentage of people in each cluster for the customer population group and for the general population group and obtained the difference between the first and the second. The result is shown in the table below, where the green bar indicates an increase in customer concentration and the red bar indicates a decrease in concentration for each cluster.
The above results show that clusters of 1 and 5 have an increased customer concentration, while for other clusters the concentration is lower. Thus, a marketing campaign targeting these individuals is more likely to return new customers than those in other groups.
The K-means algorithm was chosen due to its simplicity and low complexity. More complex algorithms such as “Algomerative Clustering”, DBSCAN and OPTICS, present in the scikit-learn library, can present better results, however these solutions consume more time of execution and for this reason I opted for the simplest and fastest solution.
— These results are obtained by running the clustering.py file.
Data Exploration
Analysis of Customers
With the processed and grouped data in mind, I decided to analyze the variables CUSTOMER_GROUP, ONLINE_PURCHASE and PRODUCT_GROUP belonging to individuals who are already Arvato customers (those in the customers.csv file).
Univariate Exploration
The percentages of customers in each group were:
# CUSTOMER GROUP:
- Multi-buyer: 69.0 %
- Single-buyer: 31.0 %
# ONLINE PURCHASE:
- Non-online purchase: 91.0 %
- Online purchase: 9.0 %
# PRODUCT GROUP
- COSMETIC_AND_FOOD group: 52.6 %
- FOOD: 24.7 %
- COSMETIC: 22.7 %
Bivariate Exploration
- CUSTOMER_GROUP vs ONLINE_PURCHASE: Most customers are multi-buyers who do not buy online.
- CUSTOMER_GROUP vs PRODUCT_GROUP: Most customers are multi-buyers who buy cosmetics and food.
- PRODUCT_GROUP vs ONLINE_PURCHASE
- Online purchase: Most customers buy cosmetics & food.
- Non-online purchase: Customers are more evenly distributed across product groups.
Customer Classification: Implementation
The advantage of using the K-means algorithm is that when trained it allows predicting clusters for new samples. Thus, I used the algorithm trained in the file clustering.py to group the values of the new individuals, present in the data set mailout_train.csv, which were used to train the classification algorithm. This new set represents the target individuals of the company’s marketing campaign and with clustering we have the new feature “cluster” .
The mailout_train data has almost 43 000 data rows, which includes a column, RESPONSE, that states whether or not a person became a customer of the company following the campaign. In this part, I’ll need to create predictions on the TEST partition, where the RESPONSE column has been withheld.
To train the classifier (in the classifier.py file), I used the same columns that were used to train the clustering algorithm and filled in the missing values with the most frequent values in each column and divided this set of training and test samples for later performance evaluation.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
As the target variable of the training set is highly unbalanced, Gradient Boosting or Support Vector Machines algorithms can bring better results. These algorithms and Logistic Regression were used to train the model and verify which one has the best performance according to the ROC curve [Reference 2,3].
Below is the implementation of each of these algorithms and their respective ROC curves..
from xgboost import XGBClassifier
xgb = XGBClassifier(n_estimators=200, learning_rate=0.5, reg_alpha=0.5, max_depth=3,objective=’binary:logistic’, eval_metric=’logloss’, use_label_encoder=False)
xgb.fit(X_train, y_train)
from sklearn.svm import SVC
svm = SVC(kernel=’poly’, class_weight=’balanced’, random_state=42, C=3, probability=True).fit(X_train, y_train)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.1, class_weight=’balanced’, solver=’liblinear’).fit(X_train, y_train)
Although the Gradient Boosting algorithm can be implemented with the scikit-learn library (sklearn.ensemble.GradientBoostingClassifier), I decided to use the algorithm implemented in xgboost because it is faster. The chosen values were assigned in order to reduce overfitting, since the higher the values of the parameters n_estimators, max_depth and learning_rate the greater the chances of overfitting.
This was done for the other algorithms with respect to parameter C, which when increased increases the chances of overfitting. The parameter class_weigth refers to the imbalance between the binary classes. The ‘balanced’ option seeks to compensate for the imbalance between 0 and 1 of the training set. The other parameters (kernel and solver) were chosen because they present better results in the simulations performed.
The ROC curve is a probability curve and the area under that curve (AUC) represents the degree of separability, that is, how much the model can distinguish one class from another. A high AUC value indicates that the model is good at predicting that 0s are 0s and 1s are 1s. The AUC values for each tested algorithm were:
No Skill: ROC AUC=0.500
XGB: ROC AUC=0.721
SVM: ROC AUC=0.604
Logreg: ROC AUC=0.660
The value of AUC = 0.5 indicates that an inability of a model to differentiate classes, that is, the use of no technique, just a random class assignment, would present this value. Of the tested algorithms, Gradient Booster was the one with the highest AUC (0.721), which means that it can distinguish more correctly between the values 0s and 1s.
Model Evaluation and Validation
To evaluate the Gradient Boosted model, I used the test set and compared the precision, recall, f1-score and accuracy metrics with those of the training set. The results obtained are shown below:
- Training set: As expected, the measurements in this set were close to 1.
- Test set: The precision, recall and f1 score measures are approximately 1 just to identify class 0, but to identify class 1 these measures are 0.
Therefore, even though the Gradient Booster is the best solution of the three tested, it still does not have a sufficient performance, considering that the algorithm tends to classify values as 0.
Conclusions
In this work, I have clustered individuals from a sample of the general population, using demographic characteristics as features. The population was segmented into 6 groups, from which it was noticed that individuals who were already Arvato customers were mostly in groups 1 and 5. Using this segmentation as a feature of the dataset, I trained a classification algorithm to predict potential customers based on demographic characteristics.
When dealing with datsets for the application of the learning models, I tried to keep the largest number of features and filled in the missing values based on the frequency of appearance of the values in each column. As future studies, it is possible to test other forms of imputing variables or other types of preprocessing, such as the implementation of the Principal Component Analysis (PCA) or Non-negative Matrix Factorization (NMF) algorithm that are able to find more significant representations for the dataset [Reference 1].
The challenge of this work is to obtain a clustering and classification algorithm with good performance. As clustering is a model and it is unsupervised, it is more difficult to obtain a metric that indicates the performance of the adopted solution. In addition, robust algorithms like DBSCAN and OPTICS take a long time to simulate, which led me to exclude these options. The same happened with the choice of the classification algorithm, the execution time is very long and I ended up choosing the fastest solution, the Gradient Booster.
Another issue is the fact that the target variable is unbalanced and this makes the performance of the algorithm even more difficult. There are some approaches that can be useful in this case, for example, undersampling and oversampling [Reference 4]. An ideal solution would be to use a technique like Grid Search with cross-validation to select the ideal parameters needed, even if this requires a longer execution time.
References
- Müller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: A Guide for Data Scientists. Bejing: Oreilly et Associates.
4. https://stats.stackexchange.com/questions/131255/class-imbalance-in-supervised-machine-learning