“Plaid’s API helps developers provide financial services to tens of millions of consumers in North America. These services help consumers manage their personal finances, let them transfer money and make payments, and allow them to obtain loans and mortgages. Our mission is to improve people’s lives by providing access to the financial system.
Plaid’s API helps developers provide financial services to tens of millions of consumers in North America. These services help consumers manage their personal finances, let them transfer money and make payments, and allow them to obtain loans and mortgages. Our mission is to improve people’s lives by providing access to the financial system.
We accomplish this mission not only by helping consumers access their financial data, but also by improving data quality. Enriching data with machine learning is one of the goals of our data science and infrastructure teams, and in this post, we will also discuss the ML models our teams have built.
The Pending and Pending-Posted Conundrum of Transactions
One way Plaid adds value to traditional banks’ transaction data is by determining when to post pending transactions from consumer accounts. While the bank is processing the consumer’s transaction, the transaction is pending processing by the bank, and during this pending period, the banknote amount will be deducted from the account owner’s available funds, but not the account balance. After the transaction is completed, the transaction changes from “pending” to “posted”, and the posted transaction will eventually be completed and deducted from the account balance.
When Plaid takes an account snapshot, we receive a list of transactions with a description, monetary amount, and whether the transaction status is pending or posted. While we know if a transaction is pending, banks typically do not tell us which pending transactions in previous snapshots correspond to newly published transactions in the current snapshot. This match is critical to customers. If they send notifications to consumers of every new transaction, it is important that they do not receive repeated notifications.
Unfortunately, it is often unclear which of the bank’s published transactions map to the previous consumer group’s pending transactions at the merchant? A common difficult matching problem is restaurant bills, when the consumer’s credit card charges the bill at the restaurant, the restaurant initiates a pending transaction, and it does not include service charges and tips. Once the restaurant’s receipts are batched (usually at the end of the working day), they complete the transaction by doing it in a unified transaction, which is when the transactional transaction becomes posted.
In other cases, the corresponding pending and posted transactions may look different. Hotels often use higher pending charges as a holdover account for incidental charges. Once the transaction is posted, it can be settled to the actual billed amount. Merchants, payment processors and financial institutions can change the description of this transaction.
Our high-level approach to this problem is to build a model to predict this likelihood or match score: Are the two pending and posted transactions from the bank and consumer group the same? If a pending transaction disappears from one account snapshot to the next, we match it with the “most likely” published transaction shown on the new snapshot. When the matching score is above a certain threshold, the matching continues greedily.
The crux of the matter is choosing a model to determine this match score.
decision tree algorithm
To address this, the rules we initially considered would tell us how well a particular pending and published transaction matched. Here is a visual representation of an example rule that matches pending and published transactions initiated by restaurants:
This rule-based approach, called a decision tree, segments the space of independent variables, such as information about transactions, and attempts to find regions of this space that might correspond to matching transactions. While the decision trees in the above visualizations output Boolean predictions, decision trees are often used in more powerful machine learning, including in our models, to output predictions of likelihood instead.
Algorithms for training decision trees, independent trees are rarely used in practice. This is because they tend to learn the noise behind the training data rather than the underlying relationships in the data. The decision tree may have misunderstood irrelevant transaction descriptions because many transactions that do not match each other have similar transaction descriptions. This problem is called overfitting.
Too much model complexity leads to overfitting because it allows the model to warp to the training data.
Overfitting is referred to as “high variance” because an overfitted model is strongly dependent on the training data, and small changes in the input will result in large changes in the predictions.
On the other hand, insufficient variables and insufficient model complexity lead to underfitting, where the model is too inflexible to find meaningful relationships in the training data. Underfitting is referred to as “high bias” because the underfit model has significant systematic forecast errors or biases.
A fundamental challenge in data science is the bias-variance trade-off. Accidentally increasing model complexity can lead to higher variance and lower bias. If our models are optimized solely based on a measure of bias (such as accuracy on the training set), they will tend to overfit.
In order to solve the matching problem for publication without overfitting, our first model augments the notion of decision trees using bagging and feature sampling. Let’s discuss bagging first, which refers to bootstrap aggregating.
“Bootstrapping” is the process of training a model on a random sample of the training data. By limiting the amount of data used during training, bootstrapping combats overfitting by providing different noise distributions during training.
“AggregaTIng” is the process of combining many different bootstrap models. For bootstrapping trees, the aggregation process typically “votes” on the tree by computing the average of the likelihoods predicted by the tree. Since the training subset is randomly sampled, the decision tree still fits the dataset on average, but the votes give more robust predictions.
Combine Bootstrapping and AggregaTIng results for bagging.
If the component models are uncorrelated, bagging the bagging model reduces variance more. However, bootstrapping only on different training data samples often results in trees with highly correlated predictions, since the most informative branching rules are often similar across the sampled training data. For example, since transaction descriptions are a strong indicator of whether unprocessed and processed transactions match, most of our trees will rely heavily on this metric. In this case, bagging has limited power to reduce the variance of the overall model.
This is where feature sampling comes in.
To reduce tree correlation, our model randomly samples features in addition to randomly sampling training data, resulting in a random forest. A staple of a data scientist’s toolkit, random forests are powerful predictors with low risk of overfitting, high performance, and high ease of use.
This is the model Plaid has used for years to match pending and published transactions. Over time, this approach proved effective, but not very well: when we evaluated the model on human labeled data, we noticed a high false-negative rate. We need to improve the model to find matches more reliably.
When random forest fails
Random forests and general bagging are susceptible to inappropriate imbalanced datasets. We ran into this problem with a random forest model matching from pending to posted. Since each pending transaction has at most one published transaction in the training set, most candidate pairs of pending and posted transactions do not match. This means that there is an imbalance in our training set, where the vast majority of the data “doesn’t match”; therefore, our random forest model incorrectly predicts a lower probability of a match, resulting in a higher false negative rate.
To solve this problem, we use boost. BoostTIng restricts decision trees to simple forms – i.e., the trees are not very deep – to reduce bias in the overall model. The augmentation algorithm iteratively explores the training data, adding a restricted tree that maximizes the aggregated model. As with bagging, the vote of the tree determines the final decision.
From this process we eventually learned that improving performance in the few cases – matching trading pairs – would maximize model improvement. The algorithm digs into the conditions that identify the situation, and with carefully tuned hyperparameters, we finally see a significant improvement in our false negative rate.
Another advantage of boosting is the flexibility to define a “model improvement” measure during training. By assigning asymmetric penalties to false positives and false negatives, we train a model that is more in line with how these model errors affect consumers asymmetrically.
Compared to the random forest model, our new boosted model reduces the false negative rate by 96%, ultimately providing our clients and consumers with higher quality transaction data. In addition to the improvement in internal metrics, we have also seen a significant reduction in client submissions regarding pending transactions due to maturity.
It is essential to understand how the characteristics of machine learning model prototypes lead to different advantages and disadvantages. While our new model is a significant improvement in the quality of the data we provide, it comes with its own tradeoffs. Boosting is sensitive to model improvement metrics and other hyperparameters that limit the tree must be simple. In this case, an improved consumer experience is well worth a careful training program and fine-tuning.
There’s more we haven’t explored yet. For example, given the large number of categorical variables we use, which boosting algorithm is the best? Given that one in four Americans uses a bank account to process transactions multiple times a day, how do we ensure our matching algorithms are fast enough to keep up? Given the difficulty of interpreting and explaining the reasoning behind its outputs, are deep neural networks worth the investment for this problem?