Benefit of cohort structure in promo recommendation
By: Aleksey Kocherzhenko, Phil Dong, Shubhaditya Burela
Introduction
Predicting the optimal promotional campaigns for targeting existing customers based on their preferences is a common marketing problem. Personalised recommendation systems usually employ models that rely on the user features (such as demographic information) and the campaign features (such as the products advertised, the discounts offered, etc.) to make accurate predictions. However, the entirety of the features for the users as well as for the campaigns is almost never available. The model is restricted to using only the known user and campaign features that may not be fully predictive of a user’s preferences. In traditional marketing approaches, users are often split into cohorts based on the perceived similarity of their preferences. Under certain circumstances, knowledge of the cohort that a user belongs to may provide additional information about a user. Effectively, if chosen well, a cohort identifier may act as a proxy for a user’s unknown features, so that including it as input to the model makes the model’s predictions more accurate. In this study we consider when this happens and suggest an approach to grouping customers into cohorts based on their responses to past promotional campaigns.
Data generation
Consider a simple model case. We generate a synthetic data set where:
- every user is completely described by three user features;
- every campaign is completely described by three campaign features;
- the response of a user to a campaign is binary (the user either responds to the campaign or not).
The goal of the model is to predict the user responses to the campaigns as accurately as possible. The actual response of a user to a campaign in the data set depends on all three user features and all three campaign features. However, only two of the user features (and all three of the campaign features) are assumed to be known and are provided to the model as input. The third user feature is hidden from the model and is treated as if it were unknown.
The generated data set contains 1,000 users that are evenly divided into 10 cohorts. The 3d feature vector contains two visible features and one hidden feature are randomly sampled from a normal distribution centered around a mean feature vector. The elements of the mean feature vector for each cohort is randomly selected from a uniform distribution on the interval (-1, 1). The standard deviation is the same across all cohorts, but can be varied separately for the visible features and the hidden feature. We have generated several data sets with varying standard deviations for this study. By construction, the cohort structure affects both the visible features and the hidden features of the users.
The generated data set contains 80 campaigns. No cohort structure is assumed for the campaigns. The 3d feature vectors for the campaigns are randomly selected from a uniform distribution on the interval (-1, 1).
Figure 1 visualises the user features for all users in two different data sets (with low and high standard deviation), as well as the campaign features for all campaigns in one data set.
We generate the actual response for each of the 1,000 users to each of the 80 campaigns, for a total of 80,000 data points in each generated data set. For each user-campaign pair, we draw the binary response from a Bernoulli distribution that depends on the inner product between the user feature vector and the campaign feature vector:
Where u_i is the user feature vector for user i and c_j is the campaign feature vector for campaign j, and S() is a sigmoid function:
Note that the response is a stochastic nonlinear function of all user and campaign features, which limits the possible prediction accuracy.
Cohort ID improves model performance
To investigate how a model can take advantage of cohort structure in the data, we trained an XGBoost model to predict the response from different subsets of features and evaluate the accuracy with 5-fold cross-validation. To this end, we trained the model with three different sets of inputs:
- Visible Features: only the two visible user features.
- Cohort ID + Visible Features: the two visible user features and the cohort ID as a categorical feature.
- All Features: all three user features (including the hidden user feature).
The three campaign features were always available to the model.
We hypothesised that the variance of user features within a cohort would have an impact on the model performance. To study this, we varied the variance of visible user features and hidden user features independently during our training. The result is summarised in Figure 2.
We observed that in general the prediction accuracy decreases as the feature variance increases for all variations of the model. The addition of Cohort ID improves the accuracy when comparing to visible features alone. This improvement is minimal when the visible feature variance is low; in this case, all three models perform similarly well. This is expected since the visible features in this case contain almost all the information about the system: enough to determine both the cohort of the user as well as the response. However, as the visible feature variance increases, the addition of cohort ID starts to significantly improve model predictions. This improvement is most prominent when the hidden feature variance is low, where the addition of cohort ID results in an accuracy that’s comparable to training with all features. Based on these observations, we hypothesised that cohort ID is able to provide an improvement to the prediction accuracy by relaying information about the hidden user feature.
To test our hypothesis, we focused on a case where the hidden user feature variance was relatively low (0.1). We then investigated how much information is shared between the hidden feature and the two visible features, as well as the cohort ID. To quantify shared information, we computed the mutual information between features using nonparametric entropy estimation (https://github.com/BiuBiuBiLL/NPEET_LNC). The result is summarised in Figure 3. As we have seen earlier, when the visible feature variance was low, the three variations of the model performed similarly. In this case, the mutual information between the two visible features and the hidden feature is similar to the mutual information between the cohort ID and the hidden feature. In contrast, the visible features contain very little information about the hidden feature when the variance is high. However, the cohort ID still contains a significant amount of information about the hidden feature. With these observations, we concluded that including the cohort ID as a model feature was able to improve the prediction accuracy of a model by providing information about the hidden user feature.
Response history can be used to predict cohorts and improve accuracy
We have shown that including the cohort ID as input can significantly improve the prediction accuracy of the model. However, in real world scenarios, the segmentation of users into cohorts by their preferences may be unknown a priori. Yet, assuming that such cohort structure exists, we might look for a way to determine it and take advantage of it.
A naïve approach would be to cluster the user feature vectors and use the clustered label as a proxy of cohort ID. However, if we only use the visible feature to perform the clustering, we expect the clustering accuracy to be poor when the visible feature variance is high and the cohort ID could most improve the model performance. Because the visible features share very little information with the hidden feature when the variance is high, we cannot expect cohort IDs deduced by clustering on these features to generate new information and improve the prediction accuracy. A key insight is that in real world scenarios, we often know some history of previous user responses that contain information about the hidden features, because the response is presumed to be a function of all user features. Following this idea, we performed clustering in a high-dimensional augmented space, where the binary response to previous campaigns were treated as dimensions and concatenated with the two visible user feature dimensions. As shown in Figure 4, a K-Means clustering on the augmented space can predict the real cohort ID accurately, while clustering on the space of visible features alone results in poor performance due to the limited information contained in the visible features.
Next, we tested whether including the cohort ID predicted using clustering algorithms as input to the model results in an improvement of the response prediction accuracy. For our generated data sets, we used K-Means clustering on the response-augmented space and then used the predicted cohort IDs as the input to the XGBoost model (in place of the actual cohort IDs that we used in the previous section). We performed experiments similar to the ones described earlier, where we varied the visible and hidden feature variance and benchmarked the results using 5-fold cross validation. Importantly, a subset of campaigns were held out from the training data set during the cross validation and used for testing only, so that the model could not use the responses from the same campaign to predict new data.
The results of our experiment are summarised in Figure 5. We can see that the behaviour of the cohort IDs obtained using the response-clustered algorithm closely mimics the behaviour of the model trained with the real cohort ID, shown in 2. Using the response-clustered cohort ID significantly improves the response prediction accuracy when the visible feature variance is high and the hidden feature variance is low. For the case when the hidden feature variance is low (0.1), we computed the mutual information shared between the hidden user feature and the visible user features, real cohort ID, and response-clustered cohort ID, similarly to 3. These results are shown in Figure 6. When the visible feature variance is high, the response-clustered cohort ID can provide an accuracy improvement similar to the model trained with real cohort ID. Accordingly, both the predicted and the real cohort ID shared a substantial amount of information with the hidden user feature, in contrast to the visible features. Based on these observations, we concluded that using previous responses to deduce the cohort structure can provide a substantial improvement in model prediction accuracy.
Conclusion and Discussion
In this study, we found that when there is a cohort structure in the user base, we can take advantage of the user cohorts and improve response prediction accuracy by including the cohort information as a categorical feature to the model. Furthermore, in real world situations when the cohort information is not known, we can often use the responses of users to previous campaigns to predict the cohort information and achieve improved prediction accuracy. While responses of users to previous campaigns can be directly included as inputs to a model, resulting in similar improvements for the predictions, our clustering approach has advantages in interpretability, generalisability, and efficiency. Identification of an underlying cohort structure allows us to learn more about our users and can make the model easier to interpret. Because the response-clustered cohort IDs are essentially user features, they could potentially be useful for predicting other user behaviour. Furthermore, training the model with a one-dimensional predicted cohort ID is more efficient than using the responses to all previous campaigns. The response-clustered cohort IDs can be calculated once for each user and will provide information about that user’s underlying hidden feature so long as that feature does not change over time.