Reaching your most valuable audiences: Predicting Conversion Intent Part 2
by Lotte Jonkman, on Dec 4, 2020 11:30:14 AM
Blog 2 out of 4: feature selection analysis
- Blog 1: capture the behavioural ‘rules’ that lead to your target
- Blog 2: feature selection analysis
- Blog 3: training/evaluating the model
- Blog 4: deploying the model
We often use three different ways to analyse the features in a dataset:
- Chi-square test
- Mutual Information
- Correlation Matrix
The chi-square test is a statistical test that determines the relationship between two variables, in this case, the feature and the target. We calculate the chi-square statistics for every feature variable and target variable and analyse how strong the relationship is. If the relationship is not so strong, we can discard that feature variable, if it is strong, the feature is very important.
In this example, I continued with the dataset created in the first blog where we predict conversion intent based on a set of features. You can see that the strongest relationships are all previous revenue-based features, next to the city of the customer.
Mutual information quantifies the amount of information obtained about one variable, in this case, the target, through the other variable, the feature. Mutual information gives you a score for each feature of your data, the higher the score the more important or relevant is the feature towards your output variable.
Using the same data as above, we see that the following features in the top 5:
A correlation matrix is a matrix that shows all the correlations between the features and the target and the features with other features. If this number is positive, it means that the increase of the value of one feature increases the value of the other feature (or the other way around). Negative means that the increase in the value of one feature decreases the value of the other feature. The bigger the number the stronger the relationship.
When we calculate the correlations, we usually put them in a spreadsheet and use conditional formatting to see which features are highly correlated with the target. Also, we check which features are highly correlated with each other since we might only need to use only one feature.
As you can see in the image above, the max of the clicks on a call-to-action is highly correlated with the sum of the clicks on a call-to-action, which is logical. We can see which of the two is performing better in the model using the results of the other feature analysis tests and see if we can leave one of the two out of the model.
By performing this analysis we can determine which features are useful to add and which are not. This helps us to reduce the set of features. This reduced set of features is not guaranteed to give us the best model yet. But practically, it does reduce our search space for the best model.
In my next blog of this series, I will describe the next step, which is to create models and investigate if leaving out a subset of features is beneficial for the model’s performance.
Crystalloids helps companies improve their customer experiences and build marketing technology. Founded in 2006 in the Netherlands, Crystalloids builds crystal-clear solutions that turn customer data into information and knowledge into wisdom. As a leading Google Cloud Partner, Crystalloids combines experience in software development, data science, and marketing, making them one of a kind IT company. Using the Agile approach Crystalloids ensures that use cases show immediate value to their clients and frees their time to focus on decision making and less on programming.