Separating theory from application
As it stands, machine learning has a fundamental theoretical problem, which we’ll illustrate with an analogy. Imagine that you want to drink water from a river. The water is heavily polluted by different sources, and there are a few signs that might suggest something is wrong. So, you decide that you need to check if the water is safe first, and after think about the means to remove any potential contaminants. This would mean understanding not only what pollutants look like, but coming up with a way to filter all of them.
With some difficulty, you create a sophisticated machine. It teaches itself how to detect any potential signs of a problem, and will warn you about what kind of pollution it finds.
Your machine proves to be great at telling exactly what kind of pollution it spots, especially as it sees more cases over time. However, does that mean it’s covered every type of pollution? And, can it be relied upon to then stop the pollution, without also removing water that’s safe to drink?
Where machine learning falters
Using machine learning in an attempt to filter spoofing of all kinds instead of a specific method can lead to some issues. This is because fake users have to be filtered out from a combined data set of real users, with a whole host of unclear edge cases.
Furthermore, fraudsters can farm real device data and spoof legitimate user behavior - including any attributes sent by an SDK. There are fraudsters who will make mistakes (such as creating fake user interactions that are easily spotted), every time they get caught they learn something new. As such, their next attempt could be much more sophisticated.
An example of the difficulties faced by machine learning would start with a fraudster that uses real device information of a known user (such as OS version, IDFA and locale settings) to commit fraud. On this device, spoofing an install for an app that was never downloaded would result in the machine learning algorithm having a hard time categorizing this fraud correctly when it draws on past data points. This is because, based on historical data, the user is real - so, how could it be fraud to the algorithm?
Furthermore, future, real user activity might end up being categorized as fraudulent because of poor spoofing with genuine device data. Essentially, the fact of not knowing which data point is genuine and which one isn’t creates some difficulty when training the neural networks. We have already seen fraudsters spoofing virtually any request, including a client’s own measurement systems, with “perfect” looking data. This makes it harder to identify spoofed users even after a longer period of tracking their behavior.
In sum, machine learning doesn’t perform so well when faced with new and unfamiliar scenarios. As we’ll see in the next section, this makes it an unreliable gauge and filtering system when it comes to the real world.
Lost in translation
In order to be used as a basis for rejection, a neural network needs to make a decision at the time of install when the payout for the majority of campaigns is decided - a point in time where it knows very little about the user.
To counter that, and in order to determine user legitimacy, machine learning will attempt to detect more elaborate patterns across a larger data set, including seemingly obscure characteristics.
Now, if you’d try to unravel the decision making of such a specialized neural network the result would be a collective scratching of heads. Machine learning can create some extremely complicated rulesets, identifying a combination of seemingly unrelated identifiers in bizarre combinations.
If questioned too often, vendors who sell anti-fraud tools which heavily utilize machine learning as the basis for rejection may, in turn, decide to hide their decisions behind a black box. That is, to never explain what they do.
This could become a real problem for fraud prevention tools in the future.
Why the black box is a bad idea
Is a black box really that bad? We can demonstrate why with an example.
Imagine a network settling a dispute with a client over rejected attributions from a recent campaign. The network lacks any data to reproduce or explain the rejections, and so has to rely on the word of the client, who in turn relies on the attribution service monitoring fraud. While that might not be an issue for a small fraction of traffic that the network won’t argue about, it becomes a big problem after a certain threshold.
Once a provider loses the ability to (or doesn’t want to) explain why an attribution was rejected, then it becomes opinion. Opinions can be argued over or disagreed with. And, if we start down this path, we’ll end up in a situation where networks could try to portray every filter as just another ignorable opinion.
We’ve explained our view on what it takes to make a good filter. Our mindset of creating a system that’s logical and transparent sidesteps the issue of opinion - that is, if we try to assert our rejection in a factual way.
And so, while we see machine learning as an excellent means of detection - it shouldn’t be relied upon for rejection, at least not yet. In its current state, edge cases will be missed, and the logic behind decision making may, in the end, be rejected for opinion, leading to a lack of transparency. Instead, hard work needs to be done to build filters in the right way, that stop fraud without also rejecting installs from legitimate sources.
Going back to our analogy - with machine learning, you now know for certain there’s pollution. But that doesn’t mean it’s time to rely on that logic to begin filtering water. Your best bet? With investigation and proper filtering, you can head upstream, find where the pollution comes from, and stop them all at the source.