Canso AI

“A lie can travel halfway around the world while the truth is putting on its shoes”

Rightly said by Mark Twain in 19th Century is very much applicable to the present AI dominated world where fraudulent parties will alway be ahead of those who want to prevent it, causing irreparable damages and therefore it is crucial to focus on the prevention as cure is not an option.

Orchestrating a fraud prevention mechanism requires a systematic approach and this article is for those who want to understand the foundational elements of a good ML based Fraud Prevention System. Here, I will focus on consolidating all the pertinent components without getting deep into the implementation details for a better general understanding. Though I will be using the advertising domain as an example, the principles can be extended to other domains as well.

Before I get into ML System, let’s first understand how different entities interact in the AdTech domain -

Advertiser is the brand that launches a campaign
Publisher is the platform (website/app) where end users interact with the campaign
Middle players -
- DSP (Demand Side Platforms) — platform for advertisers to buy ad impressions
- SSP (Supply Side Platform) — platform for publishers to sell ad inventory
- Ad-network — medium for advertisers to connect with publishers
MMP (Mobile measurement partners) help in ad tracking

Involvement of multiple players not only results in very high cost to the advertisers but also poses challenges in data availability to build ML solutions around fraud prevention, recommendation engine etc.. We will talk more on the acquisition of data in subsequent blogs. But assuming that it’s available and you are a team of developers trying to kick off this project today, what would be your approach to building such a solution?

Gather Domain Knowledge

“You build a solution to combat fraud. They will find alternate ways to do it all over again.”

Fraud Prevention Systems are always a work in progress. With advancements in technology, ways to commit fraud have also evolved and will continue to do so, for example think of GenAI. It is clear that if we are not a step ahead, we won’t be able to prevent it. Only a deeper understanding of your targeted domain can make you aware of its key vulnerabilities. Therefore it is important to stay abreast of-

The target industry — knowledge of the product, accompanying loopholes that can be exploited and the sector you are dealing in.

Here are a few examples -
a. If you are in the space of rewarded ads, it’s important to know how users earn rewards and how the behavior can change if it is bots.
b. If you are in real money gaming, it’s critical to understand if there are loopholes in your app where users can collude and can take advantage of new user rewards.

‍
Fraud perpetration strategies —

Thinking ahead, even before writing the first line of your code, it’s important to understand the methods through which fraud can be committed. Think like a fraudster.
a. Some examples that are widely known are Click Farms, Bot Traffic, Install Hijacking, VPNs etc.
b. There can be many others that are specific to your product and sector. For e.g. in rewarded ads, if the user has to interact with multiple parties such as publisher, ad network, advertiser to complete an offer, then is the behavior consistent across all. In real money gaming, these loopholes can be as simple as presence of game lobbies with no players (often high value) where it’s almost certain to match with an opponent of choice and make win/loss irrelevant to exploit rewards.
‍

Fig 1. Common types of fraud and behavioral features

Build & Validate Hypotheses

Hypotheses

Once you have the holistic knowledge of your target domain, start with building hypotheses around how to capture fraudulent transactions and what behavior in data can resemble that of a fraudulent conversion postback (transaction). Don’t worry if it’s correct or not, just make sure that your hypotheses are of the form -

“If a conversion postback shows <x behavior>, it has a higher propensity to be fraudulent.”

Confidentiality doesn’t allow me to share all hypotheses publicly, but a few commonly knowns are -

If the time between click and conversion postback is unusually low, it may resemble a bot behavior
If the user is connected to a VPN, highly likely it’s fraudulent transaction, although there can be edge cases
If the user/device is hopping across geos within short time internal, it’s likely to be fraudulent

Validation

Objective of this step is to determine whether a hypothesis graduates to becoming a model feature or not.

To ensure that, wear your problem solver hat and leverage exploratory analyses. Know that you may not always have labeled data and it’s not a problem. There are 3 main approaches -

Get labeled data from partners or anti-fraud solutions — This may not always be reliable. But can definitely serve as a starting point for research.
Few hypotheses around loophole exploitation are so strong, they don’t need validation. As long as you see the behavior in data, it should be safe to block. For e.g. high risk IPs, fraud history with strong confidence, geo hopping etc.
If the hypothesis doesn’t fall in any of the above two categories, then the exploratory analyses should tie the anomalous behavior in a feature to few KPIs that are of importance to the business such as what % of these anomalous conversions end up doing in-app purchase, multiple visits, engages with app etc. If the % is low, this means the hypothesis holds merit.

One important thing to note here is, there will be several hypotheses that are based on logical rules and can be implemented even without ML training. Building a good ML solution needs time, especially to minimize false positives. Therefore, at times it’s a good decision to build a rule based Risk Framework first while the team continues working on a better ML based solution.

Engineer Features & Design Feature Store

Fraud prevention is one of the few Machine Learning use cases where real-time data availability is extremely important. For better understanding, imagine a bot converting a CPI offer within 60 sec of its first click event. If the model doesn’t have information about clicks in the last 60 sec, it is highly likely that it won’t be able to make a prediction correctly. At times we miss out on predicting more than 50% of the fraudulent attempts correctly due to unavailability of real-time data.

There are two important aspects to take care of -

Batch vs Streaming — Build streaming pipeline for features in which real-time information is critical. For rest, batch pipelines can save you cost.
Low latency Retrieval — Use key-value stores like Redis, DynamoDB through which features can be made available for real-time inference within milliseconds. Note that key-value storage is relatively costlier, therefore it’s important to know how much history is important to make correct predictions. Fraudulent transactions have limited history, often in minutes or max hours. Therefore, increasing history beyond a few days can only bring marginal to no improvement in performance.

These decisions and its implications must be carefully evaluated in the data exploration phase itself. Cost of making a wrong decision here is 100s of hours of work in correcting it, let alone the value we lose by not detecting fraud correctly in the meantime.

For instance, let’s say if you train a model using a batch feature scheduled every 15 mins. And you end up deploying this model. The model will never be able to capture behavior of fraudulent entities from the last 15 mins which may affect predictions. Later, if you see value in real-time data and decide to move to a streaming pipeline for the feature, you will need to go through the entire process of feature development, model training, evaluation and deployment all over again which is weeks of work. This can easily be taken care of with some exploratory analysis and writing a good system design document.

First feature store we built for fraud prevention had ~800 features (batch + streaming). But all the 3 models that went live had less than 50 features each. Therefore it’s important to be mindful and selective of features you are using for model training.

Development, deployment and management of features is a constant struggle for Data Science and Engineering teams. Especially when it comes to streaming features, it takes days to weeks of effort depending on complexity. Feature Platforms like Canso with no-code automation can help you ship complex batch/streaming features in less than 15mins saving weeks of effort.

Select right Model for Training

Model Selection

Fraud is an ‘extreme rare event’ Machine Learning use case with as low as 0.1% conversions being fraudulent. Therefore traditional classification models don’t work well here. Primarily there are two models used in fraud prevention solutions -

GNN (Graph Neural Network) — Good for use cases where entities are connected with relationships. E.g. They are great at capturing complex patterns in graph-structured data. Suitable for detecting fraud in financial transactions, real-money gaming, social media.
Auto Encoders — An anomaly detection framework that is quite powerful in capturing complex patterns across multiple features. Works on the principle that fraudulent transactions are anomalous and will appear as outliers when you study the entire population. Performs well in detecting fraud in advertising, credit card transactions, claims.

Anomaly Detection

Choose the model that fits your use case. For advertising, Auto Encoder based Anomaly Detection Solutions work great. Auto Encoders support both supervised and unsupervised learning. The model encodes the input data into a low dimensional space and then reconstructs the same through the decoder layers. Error between the input and reconstructed output termed as Reconstruction Error is a representative of whether it’s an anomalous conversion or not.

By default, Auto Encoders assume the same weight for all features while calculating reconstruction error during performance evaluation. As long as all features follow similar distribution and scale it will work fine. But if not, it will give poor performance. In such cases it becomes important to design new strategies to assign feature weights such as Optimization using Genetic Algorithm. I will dive deeper into this in another blog.

Inference Layer

Once you have the model ready to be deployed, the next step is to build an Inference Service that processes the requests (conversion postbacks in adtech) in real-time and make predictions to determine whether it’s genuine or fraudulent.

This service is responsible for following tasks -

Receive conversion requests from client
Fetch relevant features through feature service and pre-process the same before serving to model endpoints
Call model endpoints for inference
Call explainability framework to come up with right justification for blocking
Responds with predictions and message

This entire processings needs to be done in milliseconds to make sure requests are either converted or blocked with minimal latency.

Certain tech stack choices that work quite well here are -

You can use Flask or FastAPI to build this application. Both work well. Consider using Go/Java if the scale of requests/sec is millions.
Put load balancing and auto scaling in place so that it’s flexible enough to handle fluctuating or increasing load
Implement multi-threading for faster processing and optimal memory utilization

Setup Monitoring & Alerting

For any application running in production and speaking with live users, monitoring and alerting setup is crucial.

Monitoring of health KPIs such as CPU utilization, latency etc. helps in keeping a check on stability and taking appropriate measures to solve issues when needed.
Alerting Framework to raise alerts if the Inference of Feature Service is not working as expected. It can be of different types-
a. Alerts in case of failure. E.g. bugs in inference service
b. Threshold based alerts. E.g. High block rate — more than 10% of requests were blocked in the last 15 mins

Following tools can be used to set up monitoring and alerting -

Monitoring — New Relic, SignalFx
Alerting — PagerDuty, SignalFx. You can integrate these to Slack for ease of use.

Many times during deployments, when things go haywire, these frameworks prove to be valuable in identifying issues within minutes. Also, CI/CD (GitOps based deployment) setup and a proper testing framework create immense value in avoiding such downtimes. This should be the standard practice -

Commit your code
Get a dev-endpoint auto-deployed that you can test against
Merge if it passes quality checks
Scale up the experiment gradually to 100%

Challenges you may encounter

Few challenges that you may encounter during the process of building this solution -

Skills — Building streaming pipelines, feature stores is a niche skill. Often requires 2–3 weeks to ship 1 streaming feature. Prefer using open source solutions or Feature and ML platform to expedite development
False Positives — Ad-fraud being an extreme rare event scenario often results in low recall and high false positives. We must figure out ways to minimize damage due to false positives, although there is always a trade-off.
Data Skewness — Most of the features you will end up using will be highly skewed and will require transformation & scaling.
Dev-Prod Parity — Follow best practices to ensure there is no gap between data used for training and what is used in production for inference. Using systems like Feature Store help here.

Conclusion

In the article we have touched all the important components of building a Machine Learning based Fraud Prevention Solution. This should give you a good starting point to plan the initiatives and identify areas that need to be researched further. Technical implementation is a more focussed discussion for which separate blogs will be published by our team.

If you are interested in brainstorming on this topic, exploring solutions or speaking about anything related to fraud and AI, we’ll be happy to connect. Please visit us here to schedule a call.

‍

How to Build a Real-time Fraud Prevention System

Gather Domain Knowledge

Build & Validate Hypotheses

Hypotheses

Validation

Engineer Features & Design Feature Store

Select right Model for Training

Model Selection

Anomaly Detection

Inference Layer

Setup Monitoring & Alerting

Challenges you may encounter

Conclusion

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Kumar Sanjog

Subscribe to email updates

Learn more about Canso

Learn more about this topic

Explore similar blogs

Real Time Fraud Detection Using Apache Flink — Part 2

Real Time Fraud Detection Using Apache Flink — Part 1

Driving Faster Time to Value in Fraud Detection — A CRO Guide

Company

Resources

Company

Social