LTV forecasts: Testing, validation and error analysis
LTV forecasting is a uniquely challenging problem for which there are several competing approaches, and for which it is not straightforward to apply existing machine learning algorithms. In this post I’ll discuss testing and validating LTV models and the challenges that arise in doing so.
The primary method of validating LTV models is through back-testing — the procedure of running models with data up to a certain date in the past (i.e. a year). After running models with this back-shifted date, the predicted LTVs can be compared with the actuals to get the delta on a user and on an aggregate level.
There are a few things to keep in mind when relying on back-tests for validation:
- A large number of back-tests is required before one can get a good idea of performance. Simply running a single test will not produce meaningful output and may lead to incorrect conclusions being made. Instead, back-tests should be conducted periodically over several months to get an idea of error variance and cycles over time.
- Past performance is not always indicative of future performance. If for example during back-testing, a model shows a +20% error on iOS cohorts, one must not assume this bias is present in today’s forecasts. In fact, for a well-developed model absent of bias, one would expect zero correlation between past and present errors as the errors would ideally be randomly distributed around 0 with no dependence on any cohort or time dimension.
Data leakage: Care must be taken to avoid “data leaking”, i.e. feeding future data into models during the back-test that would not be available to it at that particular date. Similarly, if any iteration or parameter tuning is used to improve the model’s performance on a back test, it is no longer a fair representation of that model’s true performance.
Autocorrelation of predicted LTV vs. actual LTV errors aggregated by cohort install date
To illustrate point #2, the figure above shows the autocorrelation plot of D365 LTV forecast errors aggregated by cohort install date, for lags of up to 365 days. There does exist some autocorrelation between a cohort’s error and the error of previous cohorts. In an ideal scenario, errors would be truly random noise and no autocorrelation at all would be present. In practice, customer behavior tends to exhibit unpredictable cycles, and so there will exist some temporary biases that will persist until the effect of the recent change is reflected in training data. Hence, in a real-world scenario some autocorrelation will exist. However, as is evident in the chart above, the autocorrelation disappears far before 1 year. In other words, there is no correlation between the error on a cohort and on a cohort 6 or 12 months older. This can limit the ability of back-testing to detect model bias.
For this reason, one should also be cautioned against using back-testing as a method of validating future forecasts from a model. Because past errors are not necessarily indicative of future errors, a model performing well on a back-test does not mean that its current forecasts are accurate. A model may be able to predict accurately over several months of cohorts, but fail to pick up on a change in dynamics later and as a result output biased forecasts.
Once back-tests are conducted, the result is a set of user-level forecasts and corresponding actuals. The question then becomes, how does one go about judging the performance of the forecasts, and what should be acceptable success criteria? It’s particularly challenging to answer this question when looking at LTV predictions, because unpredictable variance at the user level makes generating accurate predictions for small cohorts impossible. However, this does not mean that the model doesn’t have value or utility. All models are wrong to some extent, but some are useful.
The two major success criteria for an LTV model are unbiasedness (errors averaging out to 0 over a large sample size) and directional correctness on smaller cohorts. Evaluation metrics should be chosen with these two criteria in mind.
“Traditional” metrics used to evaluate regression models, such as the R2 score or mean squared error (MSE), are not good choices for LTV models. The distribution of LTV is highly skewed across a group of users, and outliers/whales are common. These metrics are too sensitive to this and can be dominated by the presence of extreme outliers in the data, for which it would never be possible to generate accurate forecasts. The histogram below shows the type of distribution one would normally see for a mobile game:
Furthermore, they are not aligned with the actual use case of an LTV model, which is most often to aggregate user-level predictions into a cohort and look at its mean LTV. When evaluating performance of the model, the following are all better ways to look at it, and should be used in tandem to get a holistic view of performance:
- Test for ‘unbiasedness’: This is perhaps the most important feature in a good LTV model. When user-level predictions are aggregated over a large cohort (100,000+ users), there should be very little difference between the predicted average LTV and the actual. In other words, the user-level errors should have a mean of 0. Furthermore, this unbiasedness should also exist within subsets of the data, for example on iOS users specifically, or users acquired through Facebook specifically.
- Bootstrapping: Taking repeated samples of a given size (e.g. 1,000 users) from the overall validation group and analyzing the mean and variance of the errors. This is a good way to evaluate how error depends on cohort size. Once again, it is important that the distribution of bootstrapped errors is near 0 as a test for unbiasedness.
- Rank correlation: As a test for directional correctness on smaller user groups (such as individual campaigns), one can look at the rank correlation of campaign-level predicted LTVs vs. their actual LTVs.
- Bucket analysis: Grouping users into “buckets” based on their predicted future LTV, and making sure that the users in higher-value buckets did indeed have higher actuals on average. This is another way to test for directional correctness. The result of this analysis would look as follows:
|Predicted LTV Bucket||Average Predicted LTV||Average Actual LTV||User Count|
Cohort models and LTV prediction intervals
While back-testing is a useful way to make iterative improvements to user lifetime value models in general, for the reasons noted above they are not a sufficient way to provide confidence that the current forecasts made by a user LTV model are trustworthy.
Instead, a good strategy is to use seperate, cohort-level models to compare with aggregated user-level LTV predictions. Despite being less flexible than user-level predictions, there are a few advantages to using a simpler cohort-level model for validation. First and foremost, it is a good way to check whether the user-level model may be overfitting to some noise in the data, particularly when a large number of app engagement events are used as inputs. These typically add considerable forecast accuracy but also increase the risk of over-fitting, and introduce a secondary data source that forecasts now depend upon. So comparing against a cohort-level model that would not be affected by issues with this data is helpful.
Second, cohort-level models can provide prediction intervals in addition to simple point forecasts. Similar to a confidence interval, a prediction interval for a forecast is meant to contain the actual value some percentage of the time (usually 90% or 95%). This helps to not only get an estimate of a cohort’s future LTV, but also to estimate its uncertainty and the risk associated with making decisions based on the forecast.
Generating a prediction interval is much more challenging to do with user-level models, for the following reasons:
- The most powerful supervised ML algorithms (neural networks, random forests, boosted decision trees) do not provide prediction intervals, only point forecasts.
- While aggregating user-level LTV to a predicted cohort LTV is just a straightforward average, aggregation of user-level prediction intervals in a cohort interval is not straightforward.
For certain types of forecasting applications, prediction intervals tend to be too narrow in practice and not cover the full range of uncertainty in a forecast. Indeed, when we have tested prediction interval methods based upon well-established methodology for LTV predictions, we’ve found that the empirical coverage rate of “95%” intervals is closer to 50-60%. To understand why this is the case, one must consider the breakdown of different sources of error in an LTV forecast:
- Unpredictable variance in user behavior — this is inevitable especially for mobile games where the majority of revenue can come from just 1% or 2% of users in a cohort. However, for an unbiased model it should average out as the cohort size grows
- Uncertainty in model parameters or structure — regardless of the method used, any model training on a finite sample of data is going to have uncertainty in its structure. This becomes less of an issue as training set size increases.
- Fundamental changes in user behavior — ML models that are trained on a set of data make the assumption of a static underlying stochastic process. In reality, user behavior is dynamic and patterns that exist between early behavior and LTV are changing all the time. This is called concept drift and is one of the biggest challenges to address in LTV forecasting, because of the high frequency of product updates.
- One-time, unforeseen events — For example: the release of a competing product’s impact on retention and revenue. Such effects can never be accounted for by a model.
Existing methods for prediction intervals quantify uncertainty based on estimates of the level of unpredictable noise in data (#1), as well as sometimes considering also the uncertainty in estimated model parameters (#2, Bayesian approaches in particular do this in a very elegant manner). However, the last two are difficult to quantify in general, and thus difficult to bake into a prediction interval. The methodology for generating the prediction interval follows the same assumptions about data structure, and so any attempt from the model’s point of view to quantify uncertainty is going to be inherently too optimistic. That said, prediction intervals still add an extra layer of insight beyond point forecasts, as long as one understands they are not fully encompassing.
AlgoLift’s validation approach
Given the pros and cons of both methods discussed above, AlgoLift uses a combined approach to validate the LTV predictions driving our automation platform. Our validation procedure consists of:
- Back-testing user-level models to identify potential bias and optimize our methodology
- Back-testing cohort models to ensure that they are a good point of reference for validation
- Comparing our aggregated user-level LTVs to cohort models that have been validated historically, and investigating when the aggregated user-level LTV does not fall within the prediction interval of the cohort model
- Comparing subsets of our user-level LTV predictions from our proprietary models to well-established industry standards, such as the Pareto/NBD model or its variants.