What’s going into Distribution Modeling 2.0
Paul H. Müller
Co-Founder & CTO
Sep 5, 2018
Spoofed attributions are as old as attribution itself.
As soon as people discovered that there was a chance to randomly steal credit for users who installed an app organically, bad actors started to send large volumes of fake clicks with the aim of stealing both attribution and ad budgets.
The reason that this ‘chance’ exists is inherent to the nature of attribution for mobile apps. In order to do attribution correctly, we compare the advertising ID or the fingerprint of a device at the time of an ad click to the moment when the app activity (such as install) occurs.
Discovering which advertising IDs are active in which market is surprisingly easy, there are many means of obtaining them, and generating a match between a fictitious click and a real install isn’t an impossible task, even if it is based on random chance.
While conversion rates may be low, the sheer number of attributions makes click spamming a viable (and popular) method to committing fraud.
It’s also one of the most diverse methods available, with spoofed clicks originating from stacked ads, preloading, views-as-clicks or server-to-server click spoofing. Many of them are available as high- or low-frequency spam, which is a distinction we’ll discuss later in the article.
The good news is that Adjust has developed a new approach to filtering all kinds of click spamming, which is faster and much more reliable than it ever has been. Before we talk about that, let’s have a look at how things are currently done.
Detection and filtering today
So, you’ve set up a campaign, but you want to check for fraudulent activity derived from click spamming. The most widespread approach is to look at the average conversion rate of sub-publishers in a campaign.
Currently the most common way of detecting click spamming is to look at the average conversion rate of sub publishers in a campaign. Rates of 0.05% and lower are almost certainly the result of random attribution of real users. However, with some knowledge of the campaign targeting and fingerprinting instead of device ID matching, attackers can push this random chance as high as 0.5%. This is a far more legitimate ratio which can be seen in certain types of legitimate campaigns.
Another common approach to determining the correlation between clicks and installs is by using Click-To-Install-Time (CTIT) diagrams.
A normal campaign produces a chart with a characteristic peak within the first hour after install, followed by a steady decline. A graph for a traffic source that employs click spamming would show a much flatter and wider distribution, stretching out to the end of the attribution window.
Now, since the pattern is obvious to spot, it might seem like a strong basis for detection and filtering. However, this approach has a fatal flaw.
If you’re only looking at the last click that won an attribution and taking its CTIT into consideration, sampling bias is introduced. By only looking at the CTIT of the winning click spam, we’re exposed to some pretty serious blinds spots.
The best example of this vulnerability is with “high-frequency” click spam. Imagine a fraudster who sends the same click for the same device every hour, over and over. When the device finally installs, the final click sent before the conversion will have a very short CTIT, as the time between each fake click was only an hour apart.
Without pre-filtering of repeated clicks, CTIT analysis will not be able to efficiently spot this kind of fraud.
The other issue is that we only select spoofed clicks that were either lucky enough to match an organic user that had no legitimate clicks before it, or that it was faster than a real click from another network. A classic case of survivor bias.
By only looking at the CTIT of attributed installs, we miss out on the vast majority of spammed clicks that never match.
The result is that even pure click spamming does not produce a perfectly even CTIT distribution, but still shows a slight “bump” of attributed users within the first hour.
Many attribution companies only offer CTIT diagrams to clients without any further filtering. And while Adjust offers a combined filter of removing repeating clicks and then checking CTIT distribution levels on a sub-publisher level, there is always room to optimize. So, how do we improve our filters?
The next generation - Scoring all clicks
As discussed in our series of articles on fraud theory (start with part one here if you haven’t taken a look), a good filter should produce the lowest possible number of false positives and false negatives.
This means that a filter shouldn’t miss fraud, but at the same time shouldn’t wrongly classify real traffic as fraudulent either. It also should be simple to explain and rely on logical facts that the attacker cannot control.
Now, in order to determine if a certain traffic source is showing enough of a correlation between clicks and attributed installs, we track the share of attributions within the first hour vs. those after.
Every time a new attribution occurs, we check the distribution of its closest click. If it’s above a certain percentage, we allow attribution to this click. If it’s below, we deny it.
In both cases, we add the CTIT of that click to the distribution of its traffic source, a process called “scoring”. This process essentially scores sub-publishers of a network to determine their correlation between click and install.
Training a distribution for a sub-publisher requires at least a few won attributions to be certain, while high-frequency clicks get deleted without influencing the scoring.
The selection of “winning” spam also forces us to look for “less exponential” distributions instead of “perfectly flat” ones. This leaves the door open for arguments over our threshold selection.
The answer to these problems is surprisingly simple, and one we’re soon to be adapting to our technologies.
Instead of only scoring clicks that win, we will score all clicks within the list of candidates at attribution time. This way, we score around 10x more data points while training our filters much faster. It also automatically punishes high-frequency click spam, as each repeated click that wasn’t within the last hour of the install “poisons” the distribution score.
With this simple change, we can adjust the thresholds for what is considered a fraudulent distribution, reducing false positives. At the same time, we can learn from a much higher number of attributions, reducing false negatives.
Removing the need for pre-filtering repeating clicks makes the filter simpler to explain. Overall, this is a significant improvement.
What these advances mean
Let’s look at an example of how this works. Say that fraudster A is continuously sending clicks for a large number of advertising IDs directly from his server. He knows the most likely period of the day that users will be installing, so he only repeats his clicks during these hours.
Distribution Modeling 1.0
Whenever we try to attribute an install we look at all clicks within the attribution window. In the case where the number of repeated clicks is above a certain threshold, we remove all of them from the list and see which other “candidates” are left.
As Fraudster A is smart, more often than not, he stays under that threshold and gains attribution for fabricated clicks. Even if we do catch his spoofed clicks, we only remove them for the install at hand, with no learning effect for the next one.
Distribution Modeling 2.0
Instead of applying an arbitrary threshold, we look at the CTIT each click would have had, had it won, from our list of candidates. By adding each one of them to the distribution (the share of first hour vs. later attributions), the fraudster gains a negative score as soon as we find more clicks outside the first hour than inside.
After a few installs we start rejecting attribution even for clicks that only have two or more repeated clicks.
Acting on mobile ad fraud
As you can see: following our system of fraud prevention - going from detection to research to filtering - allows us to reject spoofed attributions that other “detection only” systems wouldn’t even pick up on.
We are committed to rooting out all types of fraud from our ecosystem, so stay tuned for the next wave of new filters, coming soon.