Qbeast Format can improve Fraud Detection

One common problem that we heard about is the scalability of ML models. Even if you are using simple ML models like support vector machines, linear regression, or neural networks, when you use them at large scale they can be very slow. Luckily for us, we can use sampling as a trick to increase speed without compromising accuracy. This is what the Qbeast Format can help us with, and we’ll see how to take advantage of its sampling capability.

One of the uses of machine learning algorithms is detecting fraudulent transactions by feeding a model with historical transaction data.

In this case we used the following dataset Fraudulent Transactions Prediction  which is provided in csv format and can be directly loaded with Spark. But first let’s get a general idea of the data structure and properties.


This can be easily done using pandas:

The ‘type’ column is a categorical one, which needs to be encoded. ‘nameOrig’ and ‘nameDest’ columns are strings that in this case can proportionate little information, so it’s better to remove them. The ‘amount’ and balance columns have a clearly wider range than other columns as the first step.

It is clear that some preprocessing is needed:

columns = [col for col in whole_dataset.columns]
# Removing the string columns
# Removing columns to preprocess

#Encoding of categorical features
le = LabelEncoder()
whole_dataset["type"] = le.fit_transform(whole_dataset["type"])

#Scaling the numeric features
scaler = StandardScaler()
whole_dataset[columns] = scaler.fit_transform(whole_dataset[columns])
#Add unscaled features again

To be able to test our algorithm properly, we’ll need a train set and a test set:

#Train/test splitting

Xtrain, Xtestwhole, Ytrain, Ytestwhole =  train_test_split(whole_dataset[columns], whole_dataset["isFraud"])

Now we can prepare the dataset for Spark and rewrite it in Parquet format:

train_dataset = pandas.concat([Xtrain, Ytrain], axis=1)
train_dataset = train_dataset.astype({'type':'int', 'isFlaggedFraud':'int', 'isFraud':'int', 'step':'int'})
test_dataset = pandas.concat([Xtestwhole, Ytestwhole], axis=1)
test_dataset = test_dataset.astype({'type':'int', 'isFlaggedFraud':'int', 'isFraud':'int', 'step':'int'})


Loading the dataset with pyspark:

train_data = spark.read.format("parquet").option("header", "True")
.option("inferSchema", "True").option("header", "True")

Once we have it loaded, and splitted in the train and test sets, we can rewrite it in the Qbeast format:

.option("columnsToIndex", "step,type,amount,isFraud").option("cubeSize","25000")

Indexing it by those four columns we assure that our samples will be randomly distributed in each of those four dimensions, conserving the distribution of the whole data.


We tried to check how much the training process of a random forest classifier could be sped up by using the Qbeast sampling mechanism. 

We performed the same experiment for different sampling percentages. We trained a random forest using only the retrieved data for a given percentage of sampling, and repeated the experiment several times for each value to smooth possible random fluctuations (and rebooting the random seed for each percentage)

We used the BSC’s dislib library to perform our experiments, however, if you are not familiar with it, the sklearn library should provide similar results.

for p in precisions:
	print("Computing with precision= : "+str(p))
	# Preprocessing
	rt0 = time.time()
	train = read_sample_pandas(table_path, p)
	rt1 = time.time()
	read_time_dict[p] = rt1-rt0
	dataset = pandas.concat(objs=[train, test_data], axis=0)
	dataset_preprocessed = pandas.get_dummies(dataset, columns=["type"])
	train = dataset_preprocessed[:train_objs_num]
	test = dataset_preprocessed[train_objs_num:]
	columns = [col for col in test.columns]
	Xtrain, Ytrain =  train[columns].to_numpy(),      
	Xtestwhole, Ytestwhole =  test[columns].to_numpy(),
	#Random forest
	prediction_dict[p] = []
	for i in range(10):
    	    start = time.time()
    	    rf = RandomForestClassifier()
    	    blk_size = (int(math.ceil(Xtrain.shape[0]/10)), Xtrain.shape[1])
    	    Ytrain= Ytrain.reshape(len(Ytrain), 1)
    	    rf.fit(ds.array(Xtrain, block_size=blk_size), 
              ds.array(Ytrain, block_size=(blk_size[0], 1)))
    	    pred = rf.predict(ds.array(Xtestwhole, block_size=blk_size))
    	    end = time.time()


We also measured the reading and training for a range of different percentages:

As expected, the execution time increases as the training is performed with more data.

Regarding the accuracy, a simple random forest can achieve a really high accuracy since the classes in this dataset are highly imbalanced. So even classifying all samples as not fraudulent would give a high accuracy. 

In this case the F1 score is more appropriate than the accuracy, because it gives importance not only to precision, but also to the recall, so the false negatives (which we want to avoid, so we don’t take any fraudulent as a good one) have more impact on this metric. 

(TP=True Positives, FP=False Positives, TN=True Negatives, FN=False Negatives)

The results for the F1 score are the following ones:


The results are good even when retrieving a small percentage of the whole data, being almost identical to the ‘real’ F1 with 40% of the data, while the time needed for reading and training the algorithm is around 20% of the time we would need reading the whole dataset. Even with 20% of the data, we get good results spending 10x less time on the training process. This confirms that the performed samplings are statistically significant, so the distribution of the data is conserved and therefore the training process is optimal.

Want to know more?

Book a call with Paola to know how we can help your company: https://calendly.com/paolapardo/30min