Example of training time reduction for a classifier

May 10, 2022 Shoichiro Yokotani, Application Development Expert AI Platform division

In machine learning algorithms, supervised learning can be categorized into two types: regression and classification. In this article, we will take the latter, classification, as an example and run a sample using the Frovedis learning algorithm. We will also compare the time required for learning between the Frovedis version and the scikit-learn version.

Classifiers are applied to datasets with discrete output y for many input variables. For example, the Credit Card Fraud Detection dataset used in this study has 29 different features with binary output y, such as Not Fraud and Fraud. We will use this data set to make a two-class decision using a machine learning algorithm.

Typical machine learning algorithms for classifications include logistic regression, linear support vector machines, random forests as an ensemble method of classification trees and classification trees, and gradient boosting classification trees. In this column, we will focus on two-class classification using classification trees and gradient boosting classification trees.

Gradient boosting decision trees can be used for class classification. scikit-learn version of gradient boosting decision trees does not perform parallel processing, but Frovedis version processes each decision tree creation in parallel, so it is expected to reduce training time compared to scikit-learn on very large datasets. The Frovedis version is expected to reduce training time compared to scikit-learn on very large datasets.

Supervised Learning：　A Learning Algorithm for Two-Class Classification
Two-class classification using classification trees and gradient boosting classification trees (training time comparison between scikit-learn version and Frovedis version)
Dataset：　Credit Card Fraud Detection　https://www.kaggle.com/mlg-ulb/creditcardfraud
Loading the dataset

in [1]:

import numpy as np
import pandas as pd
df = pd.read_csv('../../data/classify/creditcard.csv')
class_names = {0:'Not Fraud', 1:'Fraud'}
print(df.Class.value_counts().rename(index = class_names))

data_features = df.drop(['Time', 'Class'], axis=1).values
data_target = df['Class'].values

Not Fraud 284315
Fraud 492
Name: Class, dtype: int64

in [2]:

df.drop(['Time', 'Class'], axis=1).head()

out [2]:

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	...
0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	0.090794	...
1	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	-0.166974	...
2	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	0.207643	...
3	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	-0.054952	...
4	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	0.753074	...
	V20	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.251412	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	-0.069083	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	0.524980	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	-0.208038	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	0.408542	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

5 rows × 29 columns

Partitioning the data set for training and inference
in [3]:

from sklearn.model_selection import train_test_split
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(data_features, data_target, train_size=0.70, test_size=0.30, random_state=1)

Learning and inference with the Frovedis version of Classification Trees
in [4]:

import os, time
from frovedis.exrpc.server import FrovedisServer
from frovedis.mllib.tree import DecisionTreeClassifier as frovDecisionTreeClassifier
FrovedisServer.initialize("mpirun -np 6 {}".format(os.environ['FROVEDIS_SERVER']))

fdtc = frovDecisionTreeClassifier(max_depth=8)
t1 = time.time()
fdtc.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))

train time: 0.262 sec

Displaying inference results
in [5]:

from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score, recall_score
pred = fdtc.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))
FrovedisServer.shut_down()

confusion matrix:
[[85292 16]
[ 34 101]]
Accuracy: 99.94%
Recall : 0.75
F1 Score : 0.8

Learning with scikit-learn version of classification tree
in [6]:

import os, time
from sklearn.tree import DecisionTreeClassifier as skDecisionTreeClassifier

sdtc = skDecisionTreeClassifier(max_depth=8)
t1 = time.time()
sdtc.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))

train time: 7.809 sec

Displaying inference results
in [7]:

pred = sdtc.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))

confusion matrix:
[[85295 13]
[ 37 98]]
Accuracy: 99.94%
Recall : 0.73
F1 Score : 0.8

Training and inference with Frovedis version of gradient boosting classification tree
in [8]:

from frovedis.mllib.ensemble import GradientBoostingClassifier
FrovedisServer.initialize("mpirun -np 6 " + os.environ["FROVEDIS_SERVER"])

fgb = GradientBoostingClassifier(n_estimators=500, learning_rate= 0.01)
t1 = time.time()
fgb.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))

train time: 16.839 sec

Displaying inference results
in [9]:

pred = fgb.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))

FrovedisServer.shut_down()

confusion matrix:
[[85300 8]
[ 35 100]]
Accuracy: 99.95%
Recall : 0.74
F1 Score : 0.82

Learning and inference with scikit-learn's version of gradient boosting classification trees
in [10]:

from sklearn.ensemble import GradientBoostingClassifier
sgb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.01)
t1 = time.time()
sgb.fit(X_train, y_train)
t2 = time.time()
print ("train time: {:.3f} sec".format(t2-t1))

train time: 1398.770 sec

Displaying inference results
in [11]:

pred = sgb.predict(X_test)
cmat = confusion_matrix(y_test, pred)
tpos = cmat[0][0]
fneg = cmat[1][1]
fpos = cmat[0][1]
tneg = cmat[1][0]
f1Score = round(f1_score(y_test, pred), 2)
recallScore = round(recall_score(y_test, pred), 2)
print('confusion matrix:')
print(cmat)
print('Accuracy: '+ str(np.round(100*float(tpos+fneg)/float(tpos+fneg + fpos + tneg),2))+'%')
print("Recall : {recall_score}".format(recall_score = recallScore))
print("F1 Score : {f1_score}".format(f1_score = f1Score))

confusion matrix:
[[85297 11]
[ 37 98]]
Accuracy: 99.94%
Recall : 0.73
F1 Score : 0.8

Displaying Inference Results on Frovedis Version of Gradient Boosting Classification Trees by PCA Dimensional Compression
in [12]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_test)
X_scaled = scaler.transform(X_test)

from frovedis.mllib.ensemble import GradientBoostingClassifier
from frovedis.decomposition import PCA
FrovedisServer.initialize("mpirun -np 6 " + os.environ["FROVEDIS_SERVER"])
from frovedis.mllib.ensemble import GradientBoostingClassifier
fgb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.01)
fgb.fit(X_train, y_train)

pca = PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
data = {'PCA1': X_pca[:,0], 'PCA2': X_pca[:,1], 'Target': y_test, 'Test': fgb.predict(X_test)}
df = pd.DataFrame(data)
df.head()

	PCA1	PCA2
0	0.587710	-0.102335
1	0.510780	-0.442828
2	0.466657	-0.192914
3	0.354354	0.037564
4	0.471092	-0.166892

in [13]:

import matplotlib.pyplot as plt
df_0 = df[(df.Target==0) & (df.Test==0)]
df_1 = df[(df.Target==1) & (df.Test==1)]
df_test_0 = df[(df.Target==1) & (df.Test==0)]
df_test_1 = df[(df.Target==0) & (df.Test==1)]
plt.figure(figsize=(10,7))
plt.scatter(df_0['PCA1'], df_0['PCA2'], color='r', s=2, label='Not Fraud')
plt.scatter(df_1['PCA1'], df_1['PCA2'], color='g', s=2, label='Fraud')
plt.scatter(df_test_0['PCA1'], df_test_0['PCA2'], color='b', marker='x', s=100, label='False: Fraud->Not Fraud')
plt.scatter(df_test_1['PCA1'], df_test_1['PCA2'], color='y', marker='x', s=100, label='False: Not Fraud->Fraud')
plt.xlabel("PC1")
plt.ylabel('PC2')
plt.legend()
plt.show()

in [14]:

FrovedisServer.shut_down()

In these samples using the Credit Card Fraud Detection dataset, we first show the classification tree using the Frovedis and scikit-learn versions, and then the analysis using the gradient boosting classification tree. Finally, the classification results are graphed using PCA feature reduction. The comparison of training time between Frovedis and scikit-learn learning algorithms is shown in the table below.

Learning algorithms	Frovedis (sec)	scikit-learn (sec)	Ratio
ClassificationTrees	0.26	7.81	x30.0
Gradient boosting	16.84	1398.77	x83.1

The optimization and cross-validation of large data sets by iteratively training with varying machine learning parameters can be time consuming in many cases. In cases where data with new characteristics are frequently added, time-consuming re-training is repeated.
By using Frovedis' parallelized algorithms on SX-Aurora TSUBASA, high performance training models can be prepared frequently and quickly, reducing the cost of system development and maintenance.