Real Time Fraud Detection Using Apache Flink — Part 2

“A lie can travel halfway around the world while the truth is putting on its shoes”
Rightly said by Mark Twain in 19th Century is very much applicable to the present AI dominated world where fraudulent parties will alway be ahead of those who want to prevent it, causing irreparable damages and therefore it is crucial to focus on the prevention as cure is not an option.
Orchestrating a fraud prevention mechanism requires a systematic approach and this article is for those who want to understand the foundational elements of a good ML based Fraud Prevention System. Here, I will focus on consolidating all the pertinent components without getting deep into the implementation details for a better general understanding. Though I will be using the advertising domain as an example, the principles can be extended to other domains as well.
Before I get into ML System, let’s first understand how different entities interact in the AdTech domain -
Involvement of multiple players not only results in very high cost to the advertisers but also poses challenges in data availability to build ML solutions around fraud prevention, recommendation engine etc.. We will talk more on the acquisition of data in subsequent blogs. But assuming that it’s available and you are a team of developers trying to kick off this project today, what would be your approach to building such a solution?
“You build a solution to combat fraud. They will find alternate ways to do it all over again.”
Fraud Prevention Systems are always a work in progress. With advancements in technology, ways to commit fraud have also evolved and will continue to do so, for example think of GenAI. It is clear that if we are not a step ahead, we won’t be able to prevent it. Only a deeper understanding of your targeted domain can make you aware of its key vulnerabilities. Therefore it is important to stay abreast of-
Once you have the holistic knowledge of your target domain, start with building hypotheses around how to capture fraudulent transactions and what behavior in data can resemble that of a fraudulent conversion postback (transaction). Don’t worry if it’s correct or not, just make sure that your hypotheses are of the form -
“If a conversion postback shows <x behavior>, it has a higher propensity to be fraudulent.”
Confidentiality doesn’t allow me to share all hypotheses publicly, but a few commonly knowns are -
Objective of this step is to determine whether a hypothesis graduates to becoming a model feature or not.
To ensure that, wear your problem solver hat and leverage exploratory analyses. Know that you may not always have labeled data and it’s not a problem. There are 3 main approaches -
One important thing to note here is, there will be several hypotheses that are based on logical rules and can be implemented even without ML training. Building a good ML solution needs time, especially to minimize false positives. Therefore, at times it’s a good decision to build a rule based Risk Framework first while the team continues working on a better ML based solution.
Fraud prevention is one of the few Machine Learning use cases where real-time data availability is extremely important. For better understanding, imagine a bot converting a CPI offer within 60 sec of its first click event. If the model doesn’t have information about clicks in the last 60 sec, it is highly likely that it won’t be able to make a prediction correctly. At times we miss out on predicting more than 50% of the fraudulent attempts correctly due to unavailability of real-time data.
There are two important aspects to take care of -
These decisions and its implications must be carefully evaluated in the data exploration phase itself. Cost of making a wrong decision here is 100s of hours of work in correcting it, let alone the value we lose by not detecting fraud correctly in the meantime.
For instance, let’s say if you train a model using a batch feature scheduled every 15 mins. And you end up deploying this model. The model will never be able to capture behavior of fraudulent entities from the last 15 mins which may affect predictions. Later, if you see value in real-time data and decide to move to a streaming pipeline for the feature, you will need to go through the entire process of feature development, model training, evaluation and deployment all over again which is weeks of work. This can easily be taken care of with some exploratory analysis and writing a good system design document.
First feature store we built for fraud prevention had ~800 features (batch + streaming). But all the 3 models that went live had less than 50 features each. Therefore it’s important to be mindful and selective of features you are using for model training.
Development, deployment and management of features is a constant struggle for Data Science and Engineering teams. Especially when it comes to streaming features, it takes days to weeks of effort depending on complexity. Feature Platforms like Canso with no-code automation can help you ship complex batch/streaming features in less than 15mins saving weeks of effort.
Fraud is an ‘extreme rare event’ Machine Learning use case with as low as 0.1% conversions being fraudulent. Therefore traditional classification models don’t work well here. Primarily there are two models used in fraud prevention solutions -
Choose the model that fits your use case. For advertising, Auto Encoder based Anomaly Detection Solutions work great. Auto Encoders support both supervised and unsupervised learning. The model encodes the input data into a low dimensional space and then reconstructs the same through the decoder layers. Error between the input and reconstructed output termed as Reconstruction Error is a representative of whether it’s an anomalous conversion or not.
By default, Auto Encoders assume the same weight for all features while calculating reconstruction error during performance evaluation. As long as all features follow similar distribution and scale it will work fine. But if not, it will give poor performance. In such cases it becomes important to design new strategies to assign feature weights such as Optimization using Genetic Algorithm. I will dive deeper into this in another blog.
Once you have the model ready to be deployed, the next step is to build an Inference Service that processes the requests (conversion postbacks in adtech) in real-time and make predictions to determine whether it’s genuine or fraudulent.
This service is responsible for following tasks -
This entire processings needs to be done in milliseconds to make sure requests are either converted or blocked with minimal latency.
Certain tech stack choices that work quite well here are -
For any application running in production and speaking with live users, monitoring and alerting setup is crucial.
Following tools can be used to set up monitoring and alerting -
Many times during deployments, when things go haywire, these frameworks prove to be valuable in identifying issues within minutes. Also, CI/CD (GitOps based deployment) setup and a proper testing framework create immense value in avoiding such downtimes. This should be the standard practice -
Few challenges that you may encounter during the process of building this solution -
In the article we have touched all the important components of building a Machine Learning based Fraud Prevention Solution. This should give you a good starting point to plan the initiatives and identify areas that need to be researched further. Technical implementation is a more focussed discussion for which separate blogs will be published by our team.
If you are interested in brainstorming on this topic, exploring solutions or speaking about anything related to fraud and AI, we’ll be happy to connect. Please visit us here to schedule a call.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
Unordered list
Bold text
Emphasis
Superscript
Subscript
Stay in the know and learn about the latest trends in fraud, credit, and compliance risk.