1. Introduction
This project is a real-life data science task provided by partners at Bertelsmann Arvato Analytics, and it is also served as the Capstone project for Udacity Data Scientist Nano-degree.
In this project, we will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, we'll apply what we've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company.
Problem Statement
The goal of this project is to predict which individuals are most likely to convert into becoming customers for a mail-order sales company in Germany.
The project steps are:
- Data Preprocessing: In this part we need to preprocess data for further analysis.This involved converting missing value codes to NaN:s, analyzing missing values per column and row and some feature engineering.
- Customer Segmentation: The second part aimed to help the company create a more efficient way to find the right potential target customers in a general population demographics data.
This was done by Levering unsupervised learning techniques (PCA & K-means).
- Supervised Learning Model: In this part we will build a prediction model to predict whether or not a person would respond to a marketing campaign.
- Kaggle Competition: As a final part, the best model was submitted to Kaggle to see how it stacked up against models created by other users.
Metrics
For the classification problem, the evaluation metric is AUC - Area under the ROC (Receiver Operating Characteristic) curve. where the ROC curve is created by plotting the true
positive rate (TPR) against the false positive rate (FPR) at various threshold settings. This evaluation metric is preferred because the training data which we will talk about in
the fourth part is highly unbalanced.
2. Data Exploration And Preprocessing
There are four datasets and two description files associated with this project:
- Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
Each line of the demographic data file represents a person, but also contains information other than the individual, including information about their home, building, and neighbors.
Two description files:
- DIAS Information Levels - Attributes 2017.xlsx: a top-level list of attributes and descriptions, organized by informational category.
- DIAS Attributes - Values 2017.xlsx: a detailed mapping of data values for each feature in alphabetical order.
The two description spreadsheets gives more information about the features. such as:
- Auto (ex. share of car owners's age)
- Transaction activity (ex.transaction activity MAIL-ORDER/Bank, etc)
- Personality (ex. affinity indicating in what way the person is cultural/religious minded)
- Household (ex. members in a household)
- Building (ex. number of holders of an academic title in the building)
For instance, transaction activities mostly start with D19; car shares mostly start with KBA. So, a lot of the features are redundant. For example, KBA05 and KBA13 have many repetitive features. I will use the scatter plot to check feature correlation.
Data Preprocessing
After working on some projects, I’ve to say that I understand what people mean when they say that most of the time (up to 80 %) are spent on ETL, EDA, and other such tasks. This project wasn’t an exception. The steps are as follows:
Step 1: Create a complete feature dictionary file(feat_info.csv)
There are only 314 out of 366 items in the dictionary, meaning that there are a lot of features that are not listed in the given attribute description file. So, the first thing I did was to create a complete feature dictionary file, which containing all the 366 features in our datasets.
Step 2: Convert Unknown and Missing Values to NaN
The third column unknown of the feature dictionary file indicate missing or unknown data. We convert data that matches a 'missing' or 'unknown' value code into a NaN values. The figure below shows how much data is NaN.
We can see that there are a few rows with large number of missing values (left figure). and for columns(right figure), there are also many columns have very different number of missing values.
Step 3: Remove Unknown and Missing Values
I decide to drop all the features that had more than 30% NaN values, in our case 58 features. Also I drop all the rows that had more than 25% NaN values.
Step 4: Remove duplicated features
There are a lot of the features are redundant, For instance, The suffix _FEIN indicates fine-grained information, while the suffix _GROB indicates coarse-grained information.
I decided to remove these duplicate features, 'ALTERSKATEGORIE_FEIN','LP_FAMILIE_GROB','LP_LEBENSPHASE_FEIN','LP_STATUS_GROB','CAMEO_DEU_2015'. Because some of them are not defined, or the corresponding GROB feature is enough.
Step 5: Re-encodings features
There are many features that need to be re-encoded. For example, two of the values of the OST_WEST_KZ feature are the characters O, W. Since the unsupervised learning technique to be used is only applicable to digitally encoded data, we need to re-encode it.
For those skewed numerical continuous features, I applied natural logarithmic transformation to them. For other features, I only applied simple mapping.
Step 6: Split Mixed Features
Thre are four mixed features that need to be split into multiple single-type features. For example:
LP_LEBENSPHASE_GROB- lifestage rough. Split into three new features including age, family, and income(LP_LEBENSPHASE_GROB_SPLIT_FAMILY, LP_LEBENSPHASE_GROB_SPLIT_AGE,LP_LEBENSPHASE_GROB_SPLIT_INCOME).
Step 7: Remove outliers
There are many zero values (left figure) in GEBURTSJAHR (year of birth). The date of birth is 0. It is obviously unreasonable and needs to be converted to NaN. Right figure is the processed result.
Step 8: Fecture Selection by Remove Collinear variables Collinear variables are those which are highly correlated with one another. These can decrease the model's availablility to learn, decrease model interpretability, and decrease generalization performance on the test set. (work by Will Koehrsen). I use Pearson correlation coefficient to to identify the highly correlated features. With threshold is 0.85, The following 16 features to be removed.
3. Customer Segmentation
In this section, I will use unsupervised learning techniques to characterize the relationship between existing customers of mail order companies and the demographics of the German general population, and find out which types of people in the general population are more likely to be the main core customers of mail order companies and which Humans are probably not. Next, I will use two commonly used unsupervised machine learning methods, dimensionality reduction and clustering for customer segmentation.
PCA
Dimension reduction is the compression of a large number of features into (usually) a smaller feature set. Principal component analysis (PCA) is a data conversion technology that eliminates complexity, removes secondary information, and retains only the main features while allowing you to retain the primary information in the data. We will use PCA in scikit-learn to reduce the dimensionality of the dataset.
I decide to keep 100 principal components, and the cumulative variance is around 82%. I think it has been able to retain the main information in the data.
Clustering
Clustering groups data based on similarity. After choosing the principal components, we hope to help segment customers by aggregating these principal components. I use scikit-learn's K-Means to cluster the principal components.
Base on the elbow rule, I choose the number of clusters to be 10.
Cluster 4, 1, 9 are overrepresented, Cluster 8, 0, 3 are underrepresented.
So, The target customers are upper class (#1: CAMEO_DEUG_2015), wealthy(#1: HH_EINKOMMEN_SCORE),50–70 years old (#2:PRAEGENDE_JUGENDJAHRE_SPLIT_DECADE), money savers and investors with high probability (#2: FINANZ_SPARER, FINANZ_ANLEGER). They are high earners (#1: LP_STATUS_FEIN). These people are with low movement pattern(#1:MOBI_REGIO).
4. Supervised Learning Model and Kaggle Competition
Supervised Learning Model
Now start to build a predictive model. We will use the MAILOUT_TRAIN dataset to train this model, and then make predictions on the MAILOUT_TEST dataset.
The MAILOUT_TRAIN dataset is quite imbalanced(out of 42962 people, 532 people responded to the mailing activity, only about 1% conversion). So, We use StratifiedKFold to split the training data, so that to ensure the full use of limited positive data. At the same time, the parameter class_weight = 'balanced' is set when creating the model.
There are many machine learning models we can choose, starting from linear model such as ‘LogisticRegression’ or tree based model such as ‘DecisionTreeRegressor’ to ensemble models such as ‘RandomForestRegressor’ and ‘GradientBoostingClassifier’. After testing the performance of several models based on cross-validation result, I used LGBMClassifier’. Hyperparameter optimization uses hyperopt.
Top 30 features:
Kaggle Competition
Kaggle Result: roc_auc_score(Receiver Operating Characteristic Curve) is 0.76682.
Conclusions
This project aimed to leverage what we’ve learned throughout the nanodegree by building a real-life data science project. Here I listed what we have done:
• Investigated Demographics data of general population of Germany and data for customers of a mail-order company.
• Preprocessed the dataset based on feature property and statistics.
• Following this was the unsupervised learning part where PCA and KMeans together were used to perform customer segmentation. With 100 components and 10 clusters, we were able to segment the population in a way that made it possible to identify a couple of clusters that were overrepresented/underrepresented by customers.
• Apply Supervised Learning to predict whether or not a person became a customer of the company following the campaign.
As a final note, I would like to point out a couple of possible improvements that can be made to the project. For example, do more/better feature engineering. Another type of approach is to try some ensemble and stacking method to build more stronger model.
Many thanks to Udacity and Bertelsmann Arvato Analytics for an excellent project.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。